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Preface 



Machine learning methods that assign labels to examples are essen- 
tial in many application areas including speech recognition, image or 
text classification, or bioinformatics problems. Such application areas 
nevertheless pose specific challenges for classification methods. For in- 
stance, we need to appropriately represent or model the examples to be 
classified such as documents or sequences. Moreover, the class labels we 
wish to assign to the examples are often rather abstract such as topics 
in document classification and therefore lead to diverse class conditional 
populations that are difficult to separate effectively. 

Machine learning approaches for these types of classification problems 
have generally fallen into two major categories - generative or discrimi- 
native - depending primarily on the estimation criterion that is used for 
adjusting the parameters and/or structure of the classification method. 
Generative approaches rely on a full structured joint probability distri- 
bution over the examples and the labels. The models in this context are 
typically cast in the language of graphical models such as Bayesian net- 
works. The joint modeling perspective offers several attractive features 
including the ability to deal effectively with missing values or encode 
prior knowledge about the structure of the problem in a very direct way. 
Discriminative methods such as support vector machines or boosting al- 
gorithms, on the other hand, focus only on the conditional relation of 
a label given the example. Their parameterized decision boundaries are 
optimized directly according to the classification objective, encouraging 
a large margin separation of the classes. When applicable, they often 
lead to robust and highly accurate classifiers. 

This monograph explores new ways of combining these largely com- 
plementary approaches in order to address some of the key challenges in 
applied contexts. Building upon extensions of the standard maximum 
entropy estimation framework and the expectation-maximization algo- 
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rithm, the monograph provides a discriminative large margin estimation 
framework for a large array of popular generative models. The text 
includes also contributions in other related areas such as feature selec- 
tion. While requiring some prior knowledge of support vector machines 
and the associated learning theory, along with a working knowledge of 
graphical models, the monograph links these areas in a clear manner and 
provides a useful set of tools, results, and concepts for many important 
problems. 



Tommi S. Jaakkola 




This monograph is 
dedicated to my family. 
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Chapter 1 



INTRODUCTION 



It is not knowledge , but the act of learning,... 
which grants the greatest enjoyment. 

Karl Friedrich Gauss, 1808. 



The objective of this monograph is to unite two powerful yet different 
paradigms in machine learning: generative and discriminative. Genera- 
tive learning approaches such as Bayesian networks are at the heart of 
many pattern recognition, artificial intelligence and perception systems. 
These provide a rich framework for imposing structure and prior knowl- 
edge to learn a useful and detailed model of a phenomenon. Yet recent 
progress in discriminative learning, which includes the currently popu- 
lar support vector machine approaches, has demonstrated that superior 
performance can be obtained by avoiding generative modeling and focus- 
ing only on the particular task the machine has to solve. The dividing 
gap between these two prevailing methods begs the question: is there a 
powerful connection between generative and discriminative learning that 
combines the complementary strengths of the two approaches? In this 
text, we undertake the challenge of building such a bridge and explicate 
a common formalism that spans both schools of thought. 

First, we begin by motivating machine learning in general. There 
are many success stories for machine learning in pattern recognition in 
applied settings. In many cases, these applied communities have identi- 
fied various probabilistic models specifically designed and honed to re- 
flect prior knowledge about their domains. Yet these generative models 
must often be discarded when one considers a discriminative approach 
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which, ironically, can provide superior performance despite its seem- 
ingly simpler models. A formalism that synergistically combines the 
two approaches promises to improve performance even further. It could 
potentially fuse the rich modeling tools and expert domain knowledge 
in the generative learning community with task-oriented learning meth- 
ods in the discriminative learning community. We will discuss in detail 
the various generative and discriminative approaches in the machine 
learning community and identify a road map between the two. An ele- 
gant bridge will then be proposed via maximum entropy discrimination, 
a novel tool with serendipitous flexibilities and generalities. This new 
method is shown to subsume support vector machines while maintaining 
a generative modeling spirit and leads to many interesting extensions. 
The method readily accommodates a large wish-list of diverse learn- 
ing scenarios and addresses many issues which arise in the field such as 
large margin classification, regression, meta-learning, feature selection 
and learning with partially labeled data. We then extend maximum en- 
tropy discrimination to handle latent variables via an iterative algorithm 
providing a crucial aspect of probabilistic models that is often lacking in 
discriminative settings. This allows maximum entropy discrimination to 
span the gamut of generative models including classical distributions in 
the exponential family as well as contemporary mixture models, hidden 
Markov models and Bayesian networks. We flesh out these many aspects 
of maximum entropy discrimination and provide the reader with a foun- 
dation for tackling a wide range of applied problems where the power of 
both generative and discriminative learning need to be leveraged. 

1. Machine Learning Roots 

For motivation, we begin by a quick sampling of some background of 
machine learning and its roots in AI and statistics. A reader well- versed 
in machine learning, including generative and discriminative approaches, 
may skip this chapter and the next to begin directly with Chapter 3 
where novel approaches are shown to combine both tools. 

Machine learning has enjoyed a diverse history finding its roots in 
many interdisciplinary fields including artificial intelligence, neuroscience, 
cognitive science and various other areas as it eventually connected more 
closely with the field statistics. As early as 1921, when Capek coined 
the term Robot [29], the idea that a machine could be intelligent and po- 
tentially learn from observations began emerging. In 1943, McCulloch, 
a neuroscientist, and Pitts, a logician, proposed a simplified model of 
the neuron as an important atomic computational unit that could per- 
form many boolean functions and which could be combined with other 
neurons in an artificial neural network that could potentially encode 
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any logical proposition or program [120]. In 1948, Wiener coined the 
term cybernetics. Through his book [196] and several interdisciplinary 
conferences, he discussed the topics of communication and complexly or- 
ganized systems including the human nervous system, society as a whole 
or any other highly organized structure. Simultaneously, Shannon put 
forth a mathematical model of communication which started the field 
of information theory. Shannon proposed that any concept, picture or 
word could be represented, modeled and transmitted with finite sym- 
bols or bits [167]. In 1956, artificial intelligence began its days as a field 
through a first conference held at Dartmouth College. The meeting was 
led by McCarthy, Minsky, Shannon, and Rochester and posed the cen- 
tral AI problem of making a machine that behaved like a human being. 
Conversations ranged over topics like neuron nets, computer language, 
abstraction and creativity. Another crucial component of the discussion 
was artificial or machine learning which eventually grew as a sub-field of 
artificial intelligence. In 1958, Rosenblatt proposed a learning machine 
he called the perceptron [157]. This was a model of the neuron involving 
a weighted sum of inputs followed by a thresholded binary output whose 
weights could be adjusted to learn different tasks. The perceptron under- 
went a setback when, in the late sixties, Minsky and Papert showed that 
the it had limitations and could not learn certain nonlinear mappings 
[126, 127]. But, interest in perceptrons would eventually be rekindled by 
Werbos in 1974 [193] when he proposed a back- propagation algorithm for 
learning weights in a multi-layer network which could handle nonlinear 
learning. Such multi-layered networks, also called neural networks went 
on to have many successes in application areas and were used for learn 
representations, classifiers and regression mappings in many applied do- 
mains [160]. Neural networks also underwent many extensions, including 
variants such as recurrent neural networks. They were brought to bear 
on a panorama of applications in science and engineering. While neural 
networks originated in the artificial intelligence field, similar concepts in 
statistics also led to the neural network models and brought additional 
insight and extensions. 

Statistics, like learning in AI, was also concerned with the task of es- 
timating models from data or observations in general. Its foundations 
grew from the early works of many renowned mathematicians. Some 
would go so far as to say Ockham in the 13th and 14th centuries gave 
rise to the early notions of evidence and model selection in statistics. 
Through his efforts to marry Christianity and Aristotelean thought, he 
developed arguments for favoring simpler models when all other observed 
evidence was equal. This intuition was later dubbed Ockham’s Razor 
and formalized in learning and statistics problems [151]. Another key 
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figure in statistics was Bayes, who, in 1763, brought forth ideas about 
probabilities, priors, likelihoods, and posteriors which all interact via the 
now celebrated Bayes rule. Bayesians are statisticians that pay homage 
to Bayes’ work and subscribe to the approach of describing distributions 
over models in addition to data. They also allow the use of probabilities 
as measures of prior beliefs. Thus, Bayesians may sometimes be accused 
of having a somewhat subjective approach to statistics. Generative mod- 
eling is often Bayesian in its thinking and employs Bayes rule extensively. 
Bayesians are distinguished from frequentists who only consider forming 
probabilities from frequencies of observable events and data. Frequen- 
tists refer to probabilities as fractions of a set of observations while 
Bayesians refer to them more subjectively as degrees of belief in an out- 
come [16]. Frequentists avoid priors, using more objective or constant 
approaches (such as minimum variance estimators) to build model esti- 
mates from data. Frequentist approaches gained ground in the early 20th 
century with works by Ronald Fisher who introduced the concept of like- 
lihood and maximum likelihood. However, maximum likelihood and its 
many derivative tools could actually be interpreted under both frequen- 
tist and Bayesian frameworks. In fact, Bayesian approaches regained 
ground in the later half of the 20th century and currently both schools 
coexist in the field (much like generative and discriminative learning co- 
exist today). The maximum likelihood approach is deeply connected to 
the exponential family whose parameter maximization (under mild as- 
sumptions) always leads to a unique solution [146, 104, 9]. Furthermore, 
connections to to maximum entropy theory [82] as well as information 
theory [35] were developed. The popular expectation maximization algo- 
rithm was also a maximum likelihood framework which was elaborated 
by Dempster et al. [40] to permit estimation of mixture models and 
handling incomplete data and by Baum in his work on hidden Markov 
models [13, 12]. 

In the 1990’s, through important works by Pearl, Lauritzen, Spiegel- 
halter and others [142, 111], Bayesian networks and graphical models 
emerged and were shown to be generalizations of hidden Markov mod- 
els and other structured models like Markov random fields [174, 173]. 
Bayesian networks and statistical graphical models brought forth a pow- 
erful marriage between graph theory and Bayesian statistics. This ex- 
panded the flexibility of Bayesian and generative modeling to highly 
structured domains and allowed the field to accommodate the expert 
prior knowledge and structure in complex applied domains such as speech 
recognition and medical diagnostics. In fact, Bayesian networks have 
also been used to encompass certain aspects of neural networks and con- 
nections emerged via work on so-called sigmoid belief networks [133]. 
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Another key development in the 1990’s was the popularization of gen- 
eralization bounds on learning machines. This brought both applied 
and theoretical interest to classifiers and complexity tools such as the 
Vapnik-Chervnonenkis (VC) dimension [187, 186, 188]. VC-dimension 
was used to provide generalization guarantees for statistical learning, 
motivating large margin decision boundaries and brought forth a new 
contender in the learning community, the support vector machine. In 
many ways, the support vector machine was reminiscent of the percep- 
tron yet not only found a zero error linear decision boundary for binary 
classification but also ensured that it was the largest margin decision 
boundary. These support vector machines had a decidedly non-Bayesian 
approach yet showed very strong performance on various classification 
tasks. Support vector machines then spread quickly in applied arenas as 
well as motivated development of kernel methods for exploring nonlinear 
decision boundaries for practical problems [165]. Many other tools have 
also emerged in machine learning from statistical foundations, ranging 
from theoretical advances in learning theory, boosting, decision trees, 
bagging, ensemble methods, online learning, and so forth and can be 
reviewed in introductory texts [65]. 

This tour of machine learning roots leads us to two important contem- 
porary paradigms in machine learning which will be our chief concerns. 
The first is generative or Bayesian learning of probabilistic models in- 
cluding Bayesian networks. The second is discriminative learning of 
classifiers such as support vector machines and kernel methods. We re- 
view some highlights from each before we motivate a potentially very 
interesting merger of the two tools. 

2. Generative Learning 

While science can sometimes provide exact deterministic models of 
phenomena, the mathematical relationships governing more complex 
systems are often only (if at all) partially specifiable. Furthermore, as- 
pects of a hypothesized model may have uncertainty and incomplete 
information. Machine learning and statistics provide a formal approach 
for manipulating nondeterministic models by describing or estimating a 
probability density over the variables in question. Within this genera- 
tive density, one can specify partial knowledge a priori and refine this 
coarse model using empirical observations and data. Therefore, given a 
problem domain with variables Xi,. . . , X^, a system can be specified 
through a joint probability distribution over all the variables within it 
P(X i, . . . ,Xt). This is known as a generative model since given this 
probability distribution, we can artificially generate more samples of 
various configurations of the system. Furthermore, given a full joint 
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probability over the space of variables, it is straightforward to condition, 
marginalize or mathematically manipulate it to answer many potential 
queries, make inferences and compute predictions. 

In many domains, greater sophistication and more ambitious tasks 
have made problems so intricate that complete models, theories and 
quantitative approaches are difficult to construct manually. This, com- 
bined with the greater availability of data and computational power have 
encouraged many of these domains to migrate away from rule-based and 
manually specified models to probabilistic data-driven models. However, 
whatever partial amounts of domain knowledge that are available can 
be used to seed a probability model. Developments in machine learn- 
ing and Bayesian statistics have provided a more rigorous formalism 
for representing increasingly complex prior knowledge and combining it 
with observed data. Recent progress in graphical generative models or 
Bayesian networks [142, 111, 92, 97] has permitted prior knowledge to 
be specified structurally by identifying conditional independencies be- 
tween variables. Prior knowledge can also be added parametrically by 
providing prior distributions over variables. This partial domain knowl- 
edge is then combined with observed data resulting in a more precise 
posterior distribution. However, a lingering caveat is that even the par- 
tially specified aspects of a model that have been identified by an expert 
may still contain inaccuracies and may be suspect. Thus, all real models 
are wrong to a certain extent and can benefit from more robust and 
conservative learning frameworks including task-specific discriminative 
learning. 




(a) Mixture Model (b) Directed Graphical Model (c) Undirected Graphical Model 



Figure 1.1. Examples of Generative Models. In (a) we see a probability density 
composed of two Gaussian distributions. In (b), we see a directed graphical model 
depiction of the Quick Medical Reference (QMR) network, a bipartite graph for di- 
agnosing diseases from symptoms. In (c), an undirected graphical model, commonly 
referred to as a Markov Random field, is depicted. 

In Figure 1.1, we can see a few different examples of generative mod- 
els. A mixture model [16] is shown in Figure 1.1(a) which can also be 
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cast as a graphical model. In a mixture model, a parent node selects be- 
tween the two possible Gaussian emission distributions. In (b) we note 
a slightly more complex generative model with more structure. This 
is a directed bipartite graph relating symptoms to diseases as used in 
the QMR-DT medical diagnostics systems [168, 75]. In (c) a genera- 
tive model is represented as an undirected graphical model commonly 
referred to as a Markov random field which is often used in computer 
vision [51]. Another popular structured generative model is depicted in 
Figure 1.2. This is the dynamic Bayesian network representation of the 
classical hidden Markov model [97, 150, 13]. This directed graph iden- 
tifies conditional independence properties in the hidden Markov model. 
These reflect the so-called Markov property where states only depend on 
their predecessors and outputs only depend on the current state. The 
details and formalism underlying generative models will be presented in 
the next chapter. For now, we provide background motivation through 
examples from multiple applied fields where these probabilistic models 
are becoming increasingly popular. Note that this is just a small collec- 
tion of example models and is by no means a complete survey. 




Figure 1.2. Generative Hidden Markov Models. The hidden Markov model is a 
distribution depicted using the graphical model above. This highly structured net- 
work indicates conditional independencies between variables and reflects the so-called 
Markov assumption: past states are independent of future states given the present 
state. 



In the applied area of natural language processing, for instance, tra- 
ditional rule-based or boolean logic systems (such as Dialog and Lexis- 
Nexis) are giving way to statistical approaches [32, 26, 117, 91] such as 
Markov models and stochastic context-free grammars. In medical diag- 
nostics, the Quick Medical Reference knowledge base, initially a heuristic 
expert system for reasoning about diseases and symptoms has been aug- 
mented with a statistical, decision-theoretic formulation [168, 75]. This 
new formulation structures a diagnostic problem with a two layer bipar- 
tite graph where diseases are parents of symptoms. Another recent suc- 
cess of generative probabilistic models lies in genomics and bioinformat- 
ics where sequences have been represented as generative hidden Markov 
models [7]. Traditional approaches for modeling genetic regulatory net- 
works that used boolean approaches or differential equation-based dy- 
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namic models are now being challenged by statistical graphical models 
[63]. Here, a model selection criterion identifies the best graph structure 
that matches training data. Another example is in data visualization 
which has also been formalized via graphical models and dependency 
networks to give a principled representation of the data [68, 67]. 

2.1 Generative Models in AI 

In the artificial intelligence ( AI) area in general, we see a similar migra- 
tion from rule-based expert systems to probabilistic generative models. 
For example, in robotics, traditional dynamics and control systems, path 
planning, potential functions and navigation models are now comple- 
mented with probabilistic models for localization, mapping and control 
[99, 180]. Multi-robot control has also been investigated using proba- 
bilistic reinforcement learning approaches [184]. Autonomous agents or 
virtual interactive characters are another example of AI systems. From 
the early days of interaction and gaming, simple rule-based schemes were 
used, such as in in Weizenbaum’s Eliza [192] program, where natural lan- 
guage rules were used to emulate a therapy session. Similarly, graphical 
virtual worlds and characters were generated by rules, cognitive models, 
physical simulation, kinematics and dynamics [170, 178]. Statistical ma- 
chine learning techniques are currently being brought to bear in these 
arenas as well [18, 203, 22]. 

2.2 Generative Models in Perception 

In machine perception, generative models and machine learning have 
become prominent tools in particular because of the complexity of the 
domain and its sensors. In speech recognition, hidden Markov models 
[150] are the method of choice due to their probabilistic treatment of 
acoustic coefficients and their Markov assumptions for handling tempo- 
ral signals. Even auditory scene analysis and sound texture modeling has 
been cast into a probabilistic learning framework for instance, through 
tools like independent component analysis which separate signals by first 
fitting a probability model with maximum likelihood [15]. 

A similar emergence of generative models can also be found in the 
computer vision domain. Techniques such as physics based modeling 
[38], structure from motion and epipolar geometry [49] approaches have 
been complemented with probabilistic models such as Kalman filters [4] 
to prevent instability and provide robustness to sensor noise. Multiple 
hypothesis filtering and tracking in vision have also leveraged proba- 
bilistic models and tools such as Markov chain Monte Carlo and the 
condensation algorithm [74]. More sophisticated probabilistic formula- 
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tions in computer vision include the use of Markov random fields and 
loopy belief networks to perform super-resolution [51]. Structured latent 
mixtures models are used to compute transformations and invariants in 
a face tracking applications [54]. Other distributions, such as Gaussians 
over eigenspaces [128] and mixture models [24] have also been used for 
modeling manifolds of many images. Color modeling can be done with 
Gaussian models to permit skin color-based hand tracking and color- 
based object tracking in real-time [164]. Object recognition and feature 
extraction has also benefited greatly from a probabilistic interpretation 
and methods that estimate the statistics of simple features from images 
have shown promising performance in applied recognition settings [163]. 

2.3 Generative Models in Tracking and Dynamics 

Vision also relies on generative models to characterize not only the 
static aspects of an image but also for tracking and modeling dynamical 
aspects of objects and video. These temporal aspects of vision (and other 
domains) have relied extensively on generative models known as dynamic 
Bayesian networks. Temporal tracking has benefited from probabilistic 
models such as extended Kalman filters [5]. In tracking applications, hid- 
den Markov models are frequently used to recognize gesture [198, 175] 
as well as spatiotemporal activity [55]. The richness of graphical mod- 
els permit straightforward combinations of hidden Markov models with 
Kalman filters for switching between linear dynamical systems in mod- 
eling gaits [23] or driving maneuvers [144]. Further variations in the 
graphical models include coupled hidden Markov models which are ap- 
propriate for modeling interacting processes such as vehicles or pedes- 
trians in traffic [137, 130, 95]. Bayesian networks have also been used 
in multi-person interaction modeling, for instance in classifying football 
plays [73]. Hidden Markov models have also been evaluated on many 
datasets for general forecasting and time series prediction [56]. 

3. Why a Probability of Everything? 

Is it efficient to create a probability distribution over all variables in 
this generative way? The previous systems make no distinction between 
the roles of different variables and are merely trying to model the whole 
phenomenon. This can be inefficient if we are only trying to learn one 
(or a few) particular tasks that need to be solved and are not interested 
in characterizing the behavior of the complete system. 

An additional caveat is that generative density estimation is formally 
an ill-posed problem. Density estimation, under many circumstances, 
can be a cumbersome intermediate step to what is actually needed from 
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the learning machine. For instance, we only need a mapping from input 
variables to output variables. Another issue is the difficulty of density 
estimation in terms of sample complexity. A large amount of data may 
be necessary to obtain a good generative model of a system as a whole 
but we may only need a small sample to learn a more focused input- 
output mapping task discriminatively. 

The above AI and perceptual systems often work well because struc- 
tures, priors, representations, invariants and background knowledge were 
designed into the machine by a domain expert. In addition to the learn- 
ing power of the estimation algorithms, one must rely in part on seeding 
the systems with the right structures and priors for successful learning 
to take place. Can we alleviate the amount of manual effort in this pro- 
cess and take some of the human knowledge engineering out of the loop? 
One way is to avoid requiring a highly accurate probability model with 
extensive domain expertise up-front. Discriminative learning algorithms 
for such algorithms may offer increased robustness to errors in the prior 
design process and remain effective for the given task despite incorrect 
modeling assumptions. 

4. Discriminative Learning 

The previous applications we described present compelling evidence 
and strong arguments for using generative models where a joint distri- 
bution is estimated over all variables. Ironically, though, these flexible 
models have been recently outperformed in many cases by relatively 
simpler models estimated with discriminative algorithms. 

As we outlined, probabilistic modeling tools are available for combin- 
ing structure, priors, invariants, latent variables and data to form a good 
joint density of the domain as a whole. However, discriminative algo- 
rithms directly optimize a relatively less domain-specific model only for 
the classification or regression mapping that is required of the machine. 
For example, support vector machines [187, 28] directly maximize the 
margin of a linear separator between two sets of points in a Euclidean 
space to build a binary classifier. While the model is simple (linear), 
the maximum margin criterion is more appropriate than maximum like- 
lihood or other generative criteria for the classification problem. 

In fact, in domains like image-based digit recognition, support vector 
machines (SVMs) have produced state of the art classification perfor- 
mance [187, 188]. In regression [172] and time series prediction [131], 
SVMs improved upon generative approaches and maximum likelihood. 
In text classification and information retrieval support vector machines 
[45, 154] and transduct ive support vector machines [94] surpassed the 
previously popular naive Bayes and generative text models. In computer 
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vision, person detection/recognition [141, 47, 132, 69] have been dom- 
inated by SVMs which surpassed maximum likelihood frameworks. In 
some controlled tasks such as gender classification, SVMs can even ap- 
proach human performance levels [129]. In genomics and bioinformatics, 
discriminative systems play a crucial role [195, 201, 77]. Furthermore, 
in speech recognition, discriminative variants of hidden Markov models 
have recently demonstrated superior large-corpus classification perfor- 
mance [152, 153, 200]. Despite the seemingly simpler models used in 
these systems, the discriminative estimation process yields improvements 
over the sophisticated models that have been tailored for the domain in 
generative frameworks. Here, we are using the term sophisticated to 
refer to the extra tailoring that generative models provide the user for 
incorporating domain-specific knowledge about the domain (in terms of 
priors, structures, etc.). Therefore, this is not a claim about the rela- 
tive mathematical sophistication between generative and discriminative 
models but rather to their ability to handle prior domain knowledge. 
Kernel methods do provide some way of incorporating prior knowledge 
into discriminative support vector methods yet are generally not as flex- 
ible as generative modeling. 




(a) Support Vector Machine (b) Mapped Data (c) Separating Hyperplane 

Figure 1 . 3 . Example of a Discriminative Classifier. A support vector machine is de- 
picted in (a) where a nonlinear decision boundary can be found between two classes of 
data. This is done by mapping them into a higher dimensional space where the points 
become separable (b) and (c). Here the mapping shown is (x\,X2) — > (#i, #2, £1^2) 
which is separable by a linear hyperplane. 

There are deeply complementary advantages in both the generative 
and discriminative approaches yet, algorithmically, they are not directly 
compatible. Within the community, one could go so far as to say that 
there exist two somewhat disconnected camps that coexist together: 
generative modeling and discriminative estimation [158, 78, 136]. Prob- 
abilistic models provide the user with the ability to seed the learning 
algorithm with knowledge about the problem at hand. This is given in 
terms of structured models, independence graphs, Markov assumptions, 
prior distributions, latent variables, and probabilistic reasoning [20, 142]. 
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The focus of the probability models is to describe a phenomenon and to 
try to resynthesize or generate configurations from it. In the context of 
building classifiers, predictors, regressors and other task-driven systems, 
density estimation over all variables or a full generative description of the 
system can be an inefficient intermediate goal. In general, estimation and 
learning frameworks for probabilistic generative models have not been 
able to directly optimize parameters for a specific task. These models 
are therefore marred by generic optimization criteria such as maximum 
likelihood which are oblivious to the particular classification, prediction, 
or regression mapping the machine must perform. Meanwhile, discrimi- 
native techniques such as support vector machines offer less in terms of 
structure and prior modeling power yet achieve better performance on 
many datasets. This is due to their inherent and direct optimization of a 
task-related criterion. For example, Figure 1.3(a) shows a support vector 
machine binary classifier. It uses a highly appropriate criterion for learn- 
ing: find the largest margin separation boundary which we shall discuss 
in Chapter 2. This is formed by mapping data to a higher dimensional 
space via the so-called kernel method and computing the largest margin 
separating hyperplane therein as depicted in Figure 1.3(b) and (c). The 
focus of the SVM is on classification as opposed to generation which 
properly allocates computational resources directly to the required task. 

Nevertheless, as previously mentioned, there are some fundamental 
differences in the two approaches making it awkward to combine their 
strengths in a principled way. It would be of considerable value to pro- 
pose an elegant framework which could subsume and unite both schools 
of thought. 

5. Objective 

Therefore, we pose the following main challenge. We find a combined 
discriminative and generative framework which extends the powerful 
generative models that are popular in the machine learning community 
into discriminative frameworks such as those present in support vector 
machines. This framework should take us from generative models and 
Bayesian learning to support vector machines and back. Ideally, in this 
framework, all parameters and aspects of generative model will be es- 
timated according to the same discriminative large-margin criteria that 
support vector machines enjoy with their optimal hyperplane decision 
boundaries. Furthermore, the framework should give rise to many of 
the generalization, convexity and sparsity properties of the SVM while 
estimating parameters for a wide range of interesting probability mod- 
els, distributions and Bayesian networks, in the field. We enumerate 
additional desiderata as follows: 
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■ Combined Generative and Discriminative Learning 

We will provide a discriminative large-margin classification frame- 
work that applies to the many Bayesian generative probability models 
spanning many contemporary distributions and subsuming support 
vector machines. Furthermore, via an appeal to maximum entropy, 
this framework has connections to maximum likelihood and Bayesian 
approaches. The formalism also has connections to regularization 
theory and support vector machines, two important and principled 
approaches in the discriminative school of thought. 

■ Applicability to a Spectrum of Bayesian Generative Models 

To span a wide variety generative models, we will focus on the ex- 
ponential family which is central to much of statistics and maximum 
likelihood estimation. The discriminative methods will be consis- 
tently applicable to this large family of distributions. 

■ Ability to Handle Latent Variables 

The strength of many generative models lies in the their ability to 
handle latent variables and mixture models. We will ensure that the 
discriminative method can also span these higher order multimodal 
distributions. Through bounding methods similar to the expectation- 
maximization approach, we will derive iterative algorithms that grad- 
ually improve the margin and discrimination of a mixture model and 
other structured latent models. 

■ Computational Efficiency 

Throughout the development of the discriminative and generative 
learning procedures, we will consistently discuss issues of compu- 
tational efficiency and implementation. These frameworks will be 
shown to be viable in large data scenarios and computationally as 
tractable as their traditional maximum likelihood or support vector 
machine counterparts. 

■ Formal Generalization Guarantees 

While empirical validation can motivate the combined generative- 
discriminative framework, we will also identify formal generalization 
guarantees from different perspectives. Various arguments from the 
literature such as sparsity, VC-dimension and PAC-Bayes generaliza- 
tion bounds will be compatible with this new framework. 

■ Extensibility 

Many extensions will be demonstrated in the hybrid generative dis- 
criminative approach which will justify its usefulness. These in- 
clude the ability to handle regression, multi-class classification, trans- 
duction, feature selection, kernel selection, meta-learning, structure 




14 



MACHINE LEARNING: DISCRIMINATIVE & GENERATIVE 



learning, exponential family models and mixtures of the exponential 
family. 

6. Scope and Organization 

This text focuses on computational and statistical aspects of machine 
learning and assumes a certain level of prior knowledge about generative 
models and support vector machines. The following are good starting 
points for a reader interested in learning more about generative models: 
[96, 97, 142, 111, 92, 135, 44]. Conversely, the discriminative learning 
approaches are discussed in detail in the following texts and articles 
[165, 37, 70, 28, 187, 65]. We do cover some of this background material 
in Chapter 2. However, this will be done in less detail to instead focus 
more on the actual hybrid combined framework that fuses discrimination 
with generative models. 

The rest of this monograph is organized as follows: 

■ Chapter 2 

Background is given on standard machine learning methods, intro- 
ducing the topic in general. This background covers probability 
distributions and generative models such as the exponential family, 
Gaussians and multinomial distributions. It also elaborates Bayesian 
inference and maximum likelihood estimation. More advanced topics 
are then covered including expectation-maximization, Bayesian net- 
works, maximum entropy and support vector machines. The com- 
plementary advantages of discriminative and generative learning are 
discussed. We formalize the many models and methods of inference 
in generative, conditional and discriminative learning and note a road 
map that connects the various approaches. The advantages and dis- 
advantages of each are enumerated and motivation for methods for 
fusing them is given. 

■ Chapter 3 

The maximum entropy discrimination (MED) formalism is intro- 
duced as the method of choice for combining generative models in 
a discriminative estimation setting. The formalism is presented as 
an extension to regularization theory and shown to subsume support 
vector machines. A discussion of margins, bias and model priors is 
presented. The MED framework is then extended to handle gener- 
ative models in the exponential family. Comparisons are made with 
state of the art support vector machines and other learning algo- 
rithms. Generalization guarantees on MED are then provided by 
appealing to recent results in the literature. 
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■ Chapter 4 

Various extensions to the maximum entropy discrimination formalism 
are proposed and elaborated. These include multi-class classification 
and regression which follow through naturally from the initial binary 
classification problem the MED framework originated in. Further- 
more, extensions of classification and regression for simultaneously 
performing feature selection and kernel selection are discussed. These 
help improve discrimination power by adjusting the model’s internal 
representation. Meta-learning or learning with multiple inter-related 
classification and regression tasks is also shown with the MED frame- 
work. Transduction and learning from unlabeled data is then elab- 
orated in both regression and classification settings. Comparisons 
are made with state of the art support vector machines and other 
learning algorithms. 

■ Chapter 5 

Latent learning is motivated in a discriminative setting by bounding 
constraints in the MED framework and solving it iteratively. This 
mirrors the traditional expectation maximization framework yet en- 
sures that mixture models and other latent models are estimated dis- 
criminatively to form optimal classifiers. Comparisons are made with 
expectation- maximization approaches which do not optimize discrim- 
ination power. We also consider the case of discriminative learning of 
structured mixture models where the mixture is not flat but has some 
additional structures that generate an intractable number of latent 
configurations. Various properties are noted in the iterative latent 
MED framework that permit it to handle these large latent configu- 
rations in an efficient manner. This permits latent discrimination to 
elegantly extend to hidden Markov models and other elaborate latent 
Bayesian network models while remaining practical on real problems. 

■ Chapter 6 

The advantage of a joint framework for generative and discriminative 
learning is reiterated. The various lessons of the text are summarized 
and future extensions, elaborations, and challenges are discussed. 

■ Chapter 7 

This appendix gives some implementation details for the required 
optimization algorithms. 

7. Online Support 

This monograph is complemented by various online materials to help 

the student, instructor and practitioner of machine learning in obtaining 
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the most from the methods we will discuss. These online materials are 
provided on the web via the following home page: 

http : //www . cs . Columbia . edu/~ j ebara/ml 

Among other things, the additional online material includes: 

■ Corrections to this text and various errata post-publication. 

■ Course notes and materials for lectures and background information. 

■ Source code for various algorithms and methods in the text in either 
C or Mat lab form including the following tools: 

— Support Vector Machine (SVM) Classification 

— SVM Regression 

— SVM Classification with Feature Selection 

— SVM Regression with Feature Selection 

— SVM Regression with Unlabeled Data 

— SVM Classification with Kernel Selection 

— Multi-Task SVM Kernel Selection 

— Multi-Task SVM Feature Selection 

— Large Margin Variable Covariance Gaussian Models 

— Large Margin Mixture Models 

— Large Margin Hidden Markov Models 




Chapter 2 



GENERATIVE VERSUS DISCRIMINATIVE 
LEARNING 



All models are wrong, but some are useful 
George Box, 1979 



In this chapter, we review discriminative and generative learning more 
formally. This includes a discussion of their underlying estimation algo- 
rithms and the criteria they optimize. A natural intermediate between 
the two is conditional learning which helps us visualize the coarse con- 
tinuum between these two extremes. Figure 2.1 depicts the panorama of 
approaches as we go horizontally from the generative criteria to discrim- 
inative criteria. Similarly, on the vertical scale of variation, we see the 
estimation procedures range from local or direct solutions optimized on 
training data alone, to regularized solutions that use training data and 
priors, to fully averaged solutions which attempt to reduce over-fitting to 
the training data by considering a full distribution on potential solutions. 

In this chapter we begin with a sample of generative and discrimina- 
tive techniques and then explore the entries in Figure 2.1 in more detail. 
The figure shows a spectrum of generative, conditional and discrimina- 
tive learning as well as local, regularized and averaged solution methods. 
Each box in the table shows a particular framework which captures the 
given combination of discrimination and integration. Methods which are 
underlined are flexible enough and have been used with many graphical 
models, including exponential family distributions and latent Bayesian 
networks and will be emphasized in this text. 

The generative models at one extreme attempt to estimate a distribu- 
tion over all variables (inputs and outputs) in a system. This is inefficient 
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Figure 2.1. Scales of Discrimination and Integration in Learning. 

if we only need conditional distributions of output given input to perform 
classification or prediction. Thus, we can motivate a more minimalist 
approach: conditional modeling. However, in many practical systems, 
we can be even more minimalist since we only need a single estimate from 
a conditional distribution making conditional modeling seem inefficient 
as well. This motivates discriminative learning and SVMs which only 
consider the input-output mapping required for the task at hand. We 
then conclude with some hybrid frameworks for combining generative 
and discriminative models and point out their limitations. 

The many tools in this chapter including Bayesian methods, maximum 
likelihood, maximum entropy, exponential families, expectation maxi- 
mization, graphical models, junction trees, support vector machines and 
kernel methods will actually be reused in subsequent chapters to build 
a hybrid generative and discriminative framework. Therefore, we show 
some of these tools in detail since we will later call upon them. 

1. Two Schools of Thought 

We now show samples of machine learning methods from what could 
be called two schools of thought: discriminative and generative ap- 
proaches. Alternative the two competing formalisms have also been 
labeled discriminative versus informative approaches [158, 179]. Gen- 
erative or informative approaches produce a probability density model 
over all variables in a system and manipulate it to compute classifi- 
cation and regression functions. Discriminative approaches provide a 
direct attempt to compute the input-output mappings for classification 
and regression. They eschew the direct modeling of the underlying dis- 
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tributions. While the holistic picture of generative models is appealing 
for its completeness, it can be wasteful and non-robust. Furthermore, as 
Box points out, all models are wrong (but some are useful). Therefore, 
the graphical models and the prior structures that we will enforce on our 
generative models may have some useful elements yet should always be 
treated cautiously since in real-world problems, the true distributions al- 
most never coincide with the ones we have constructed. In fact, Bayesian 
inference does not guarantee that we will obtain the correct posterior es- 
timate if the class of distributions we consider does not contain the true 
generator of the data we are observing. Here we show examples of the 
two schools of thought and then elaborate on the learning criteria they 
use in the next sections. 

1.1 Generative Probabilistic Models 

In generative or Bayesian probabilistic models, a system’s input (co- 
variate) features and output (response) variables and possibly latent 
variables are represented homogeneously by a joint probability distribu- 
tion (which may potentially have a graphical structure). These variables 
can be discrete or continuous and may also be multidimensional. Since 
generative models define a distribution over all variables, they can also be 
used for classification and regression [158] by standard marginalization 
and conditioning operations. Generative models or probability densities 
typically span the class of exponential family distributions and mixtures 
of the exponential family. More specifically, popular models in various 
domains include Gaussians, naive Bayes, mixtures of multinomials, mix- 
tures of Gaussians [16], mixtures of experts [98], hidden Markov models 
[150], sigmoidal belief networks, Bayesian networks [92, 111, 142], and 
Markov random fields [202]. 

For N variables of the form (Xi, . . . , X n ), we therefore have a full 
joint distribution of the form: P(X i, . . . >X n ). Strictly for presentation 
purposes, throughout this text we use the capitalized P notation to 
denote a probability distribution instead of using lower case p. Given an 
accurate joint distribution that captures the possibly nondeterministic 
relationships between the variables, it is then straightforward to perform 
inference and answer queries. This is done by standard manipulations 
based on the basic axioms of probability theory such as marginalizing, 
conditioning and using Bayes’ rule: 

P(Xj) = P ( X !>•••>*») 

P(Xj,X k ) = PjXklX^PjXj) 

P(Xk ) P(X k ) 



P(Xj\Xk) 
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By conditioning a joint distribution, we can easily form classifiers, re- 
gressors and predictors in a straightforward manner which map input 
variables to output variables. For instance, we may want to obtain 
an estimate of the output Xj (which may be a discrete class label or 
a continuous regression output) from the input X & using the condi- 
tional distribution P(Xj\Xk). While a purist Bayesian would argue 
that the only appropriate answer is the conditional distribution itself 
P(Xj \Xk), in practice we must settle for an approximation to obtain an 
X For example, we may randomly sample from P(Xj\Xk), compute 
the expectation of P(Xj\Xk) or find the mode(s) of the distribution, i.e. 
axgmax x .P(Xj\X k ). 




Figure 2.2. Directed Graphical Models. 

There are many ways to constrain this joint distribution such that it 
has fewer degrees of freedom before we directly estimate it from data. 
One way is to structurally identify conditional independencies between 
variables. This is depicted, for example, with the directed graph or 
Bayesian network in Figure 2.2. Here, the graph shows that the joint 
distribution factorizes into a product of conditional distributions over 
the variables given their parents (here 7^ are parents of the variable X{ 
or node labeled i): 

p(x u ...,x n ) = nf =1 P(^|x^). 

Alternatively, we can parametrically constrain the distribution by giving 
prior distributions over the variables and the hyper- variables that affect 
them. For example, we may restrict two variables (Xi,Xk) to be jointly 
a mixture of Gaussians with unknown means and a covariance equal to 
identity: 



P(Xi,X k ) 



aAf 



Xi 

X k 



Ml J 



+ (1 — a)J\f 




Other types of restrictions exist, for instance, those related to sufficiency 
and separability [145] where a conditional distribution might simplify 
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according to a mixture of simpler conditionals as in: 

P{Xi\X h X k ) = aPiXilX^ + il-o^PiXilXk). 

Various flexibilities arise when we work with joint distributions since we 
can insert knowledge about the relationships between variables, invari- 
ants, independencies, prior distributions and so forth. This includes all 
variables in the system, unobserved, observed, input or output variables. 
This makes generative probability distributions a very flexible modeling 
tool. 

Unfortunately, the learning algorithms used to combine such models 
with observed data to produce the final posterior distribution can some- 
times be inefficient. Finding the ideal generator of the data (combined 
with prior knowledge) is only an intermediate goal in many settings. 
In practical applications, we wish to use these generators for the ulti- 
mate tasks of classification, prediction and regression. In optimizing for 
an intermediate generative goal, we may sacrifice resources and reduce 
potential performance on these final discriminative tasks. In section 2 
we review standard techniques for learning from data in generative ap- 
proaches and outline their lack of discriminative machinery. 

1.2 Discriminative Classifiers and Regressors 

Discriminative approaches make no explicit attempt to model the un- 
derlying distributions of the variables and features in a system and are 
only interested in optimizing a mapping from the inputs to the desired 
outputs (say a discrete class or a scalar prediction) [158]. Only the re- 
sulting classification boundary (or function approximation accuracy for 
regression) are adjusted. They thus eschew the intermediate goal of 
forming a generator that can model all variables in the system. This fo- 
cuses model and computational resources on the given task and provides 
better performance. Popular and successful examples include logistic re- 
gression [72, 64], Gaussian processes [61], regularization networks [62], 
support vector machines [187], and traditional neural networks [16]. 

Robust (discriminative) classification and regression methods have 
been successful in many areas ranging from image and document classi- 
fication [94, 154] to problems in biosequence analysis [76, 195] and time 
series prediction[131]. Techniques such as support vector machines [188], 
Gaussian process models [197], boosting algorithms [52, 53], and more 
standard but related statistical methods such as logistic regression, are 
all robust against errors in structural assumptions. This property arises 
from a precise match between the training objective and the criterion by 
which the methods are subsequently evaluated. There is no surrogate 
intermediate goal to obtain a good generative model. 




22 



MACHINE LEARNING: DISCRIMINATIVE & GENERATIVE 



However, the discriminative algorithms do not extend well to clas- 
sifiers and regressors arising from generative models and the resulting 
parameter estimation is difficult [158]. The models that discriminative 
methods do use (be they parametric or otherwise) often lack the elegant 
probabilistic concepts of priors, structure, and uncertainty that are so 
beneficial in generative settings. Instead, alternative notions of penalty 
functions, regularization, and kernels are used. Furthermore, learning 
classifiers and mappings is the focus of discriminative approaches which 
make it hard to insert flexible modeling tools and generative prior knowl- 
edge about the space of all variables. Thus, discriminative techniques 
may feel like black-boxes where the relationships between variables is 
not as explicit or visualizable as in generative models. 

Furthermore, discriminative approaches may be inefficient to train 
since they require simultaneous consideration of all data from all classes. 
Another inefficiency arises in discriminative techniques since each task a 
discriminative inference engine needs to solve requires a different model 
and a new training session. Various methods exist to alleviate this ex- 
tra work arising in discriminative learning. These include online learn- 
ing which can be easily applied to, for example, boosting procedures 
[140, 50, 53]. Moreover, it is not always necessary to construct all possi- 
ble discriminative mappings in a system of variables which would require 
exponential number of models [59]. Frequent tasks, i.e. canonical clas- 
sification and regression objectives can be targeted with a handful of 
discriminative models while a generative model can be kept around for 
handling occasional missing labels or rare types of inference. In section 4 
we discuss techniques for learning from data techniques in discriminative 
approaches. 

2. Generative Learning 

There are many variations for learning generative models from data. 
These many approaches, priors and model selection criteria include min- 
imum description length, Bayesian information criterion, Akaike infor- 
mation criterion, and entropic priors [156, 42, 21] and a full survey is 
beyond the scope of this text. We will instead quickly discuss the popular 
classical approaches that include Bayesian inference, maximum a pos- 
teriori, and maximum likelihood. These can be seen as ranging from a 
scale of a fully weighted averaging over all generative model hypotheses 
(Bayesian inference), to more local computation with simple regular- 
ization and priors (maximum a posteriori) to the maximum likelihood 
estimator which only considers performance on the given training data. 
We also review maximum entropy, latent maximum likelihood (via the 
expectation- maximization algorithm) and graphical models. 
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2.1 Bayesian Inference 

In Bayesian inference [16, 115, 20, 124, 17, 158], the probability den- 
sity function vector Z is typically estimated from a training set of such 
vectors Z — {Zi, . . . , Zt}- More generally, Z need not be a vector but 
could correspond to multiple observable variables that are both contin- 
uous or discrete. We will assume that Zt are vectors without loss of 
generality. The joint Bayesian inference estimation process is shown in 
Equation 2.1. 

p{z,&\z)d@ = j p{z\@,z)p{e\z)d@ (2.i) 

By integrating over 0, we are essentially integrating over all possible 
probability density functions (pdfs). This involves varying the families 
of pdfs and all their parameters. However, this is often impossible and 
instead a sub-family is selected and only its specific parameterization 0 
is varied in the integral. Each 0 is a parameterization of a pdf over Z 
and is weighted by its likelihood given the training set. Having obtained 
a P(Z\Z) or, more compactly a P(Z ), we can compute the probability 
of any point Z in the (continuous or discrete) probability space. We 
can also simply compute the scalar quantity P(Z) by using the above 
approaches to integrate the likelihood of all observations over all models 
0 under a prior P(0). This quantity is called the evidence and higher 
values indicate how appropriate this choice of parametric models 0 and 
prior P(0) was for this particular dataset. Bayesian evidence can then 
be used for model selection by finding which choice of parametric models 
or structures (once integrated over all its parameter settings) yields the 
highest P(Z) [90, 151]. 

However, evaluating the pdf P{Z) or the evidence P{Z) may not 
necessarily be our ultimate objective. Often, some components of the 
vector are given as inputs, denoted by X, and the learning system is 
required the estimate the missing components as output, Y, and we are 
effectively using our Bayesian tools to learn a mapping from the input 
to the output. In other words, Z can be broken up into two sub-vectors 
X and Y and a conditional pdf is computed from the original joint pdf 
over the whole vector as in Equation 2.2. 

p(y\xv = = p (^ y ) = Jf(*,r|e)P(e|*,y)de 

1 1 ' f P(Z)dY P(X) f P(X\Q)P{G\X,y)dG 

(2.2) 

Note, that we will assume we have a set of inputs X = {Xi, . . . , Xt } and 
their corresponding outputs y = {Yi, . . . , Yt}- The resulting conditional 
pdf is then P(Y\Xy . We use the j superscript to indicate that it is 



P{Z\Z) = I 
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obtained from a previous estimate of the joint density. When a new 
input X is specified, this conditional density becomes a density over F, 
the desired output of the system. The F element may be a continuous 
vector, a discrete value or some other sample from the probability space 
P(Y). If this density is the required function of the learning system 
and if a final output estimate F is need, the expectation or arg max of 
P(Y \X) is used. 

In the above derivation, we have deliberately expanded the Bayesian 
integral to emphasize the generative learning step. This is to permit 
us to differentiate the above joint Bayesian inference technique from its 
conditional counterpart, conditional Bayesian inference, which we will 
discuss later on. 

2.2 Maximum Likelihood 

Traditionally, computing the integral in Equation 2.1 is not always 
straightforward and Bayesian inference is often approximated via max- 
imum a posteriori (MAP) or maximum likelihood (ML) estimation [43, 
122] as shown below. 

P(Z\Z) «P(Z|0*,Z) 

where Q* f argmaxP(0|Z) = argmaxP(2:|0)P(0) MAP 
argmax P(Z|0) ML. 

Under iid (independent identically distributed data) conditions, it is eas- 
ier to instead find the maximum of the logarithm of the above quantities. 
This still yields the same arg max and the same model. We expand the 
above for the maximum log- likelihood case as follows: 

Z(0) = logP(Z|0) = £logP(Z t |0). 

t 

This optimization of joint log likelihood under iid conditions is additive 
in the sense that each data point contributes to the objective function 
additively. This facilitates optimization for exponential family distribu- 
tions as we shall see shortly. 

Maximum likelihood and maximum a posteriori can be seen as ap- 
proximations to Bayesian inference where the integral over a distribution 
of models is replaced with the mode. One should note, however, that 
maximum likelihood was derived by Fisher and did not originate as an 
approximation of the Bayesian inference approach above. The a posteri- 
ori solution allows the use of a prior to regularize the estimate while the 
maximum likelihood approach merely optimizes the model on the train- 
ing data alone which may cause overfitting. MAP also permits the user 
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to insert prior knowledge about the parameters of the model and bias it 
towards solutions that are more likely to generalize well. For example, 
one may consider priors that favor simpler models to avoid overfitting 
or to sparsify a model to keep it compact [182, 21]. Meanwhile, the 
maximum likelihood criterion only considers training examples and op- 
timizes the model specifically for them. It is equivalent to MAP when 
we assume a uniform prior. Unlike MAP and Bayesian inference, ML 
may show poor generalization when limited samples are available. This 
makes ML is a more local solution than MAP since it is tuned only to the 
training data while MAP is tuned to the data as well as the prior. MAP 
thus approximates the weighted averaging of all models in the Bayesian 
inference solution more closely. 

2.3 The Exponential Family 

The discussion of maximum likelihood naturally leads us to the ex- 
ponential family (or e- family for short) of distributions [11, 9, 25, 146, 
104, 27, 6]. This is the set of parametric distributions that has so called 
sufficient statistics and a consistent maximum likelihood estimate. In 
other words, the likelihood cost function has a global optimum and the 
estimate of the parameters of distributions in this family will be unique. 

The e-family (which is closely related to generalized linear models) 
has the following form: 

P(X|0) = exp(.4p0 +T(X) t 0-/C(0)). 

Here, the e-family is shown in its natural parameterization. This form 
restricts the distribution to be the exponential of a function of the data 
A(X ), a function of the model /C(0) and an inner product between the 
model and a function of the data T(X) t &. Many alternative parame- 
terization exist however this natural parameterization will be easiest to 
manipulate for our purposes. The /C(0) function is called the partition 
function and it ensures the distribution is normalized when we integrate 
over X. It is a convex function in 0, the multi-dimensional parame- 
ter vector. More specifically, /C(0) is not just any convex function but 
also given by the following Laplace transform. This is directly due to 
the normalization property of the distribution which directly generates 
convexity of /C(0) as follows: 

/C(0) = log (^ exp(.4p0 +T(Xfe)dX 

The function T(X) gives the so-called sufficient statistic because, as we 
shall see shortly, the average of T{X) across a dataset is all that is needed 
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to recover the parameters with maximum likelihood. The function A(X) 
is called the measure because it affects how we compute the integrals over 
the sample space where our data points reside. The partition function 
is also sometimes called the cumulant generating function because its 
derivative and higher order derivatives generate the cumulants (which 
are themselves functions of the moments or sufficient statistics) of the 
distribution. The mean and the covariance are the first and second 
cumulants, respectively. 



9 log K(0) 

de 

d 2 log if (0) 
d 2 & 



= E P(xle) {T(X)} 

= Ep {T(X)T(Xf} - Ep {T(X)} Ep {T(X) T } . 



In an e- family, typically, the sufficient statistics T(X) are constrained 
to live in the gradient space of /C. In other words, T{X) e MW) 
or T(X) E /C'(0) for short. In fact, a duality relates the domain of 
the function A(X) to the gradient space of /C(0) and one function can 
be computed from the other via the Laplace or inverse-Laplace trans- 
form. Table 2.3 below lists example A and /C functions for Gaussian, 
multinomial and other distributions. 



Distribution 


A{X) 


/C(©) 


Domain 


Gaussian 


-f l°g( 2 7r) 


iio g (|-20 2 - 1 |)-i^ 2 - l ei 


01 € 

02 e R°’ d < o 


Multinomial 


i°g(r(?? + i)/v) 
n = EdJl 1 Xt 
u = n d r(x d + 1) 


T) l0g(l + Ed=l 6X P(©rf)) 


©gr d 


Exponential 


0 


-log(-G) 


©eR_ 


Gamma 


— exp(X) — X 


io g r(©) 


©eR+ 


Poisson 


log(X!) 


exp(0) 


©eR 



Table 2.1 . Definitions of A and K, for some exponential family distributions. 



In addition, it is well known, that maximum likelihood estimation 
of the parameters of an e-family distribution with respect to an inde- 
pendent identically distributed (iid) data set is fully tractable, unique 
and straightforward. This is because log- likelihood remains concave in 
the parameters for the e- family and products of the e- families. It is 
also straightforward to integrate over an exponential family distribution 
with so-called conjugate priors (see below) to obtain a fully Bayesian esti- 
mate of its parameters. It is widely acknowledged that this family enjoys 
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tractable and straightforward estimation properties. For instance, con- 
sider the likelihood of an independent identically distributed (iid) data 
set X = {X U ...,X T }: 

P(X\@) = ri[ =l p(x t \e). 

If the generative model P(X t |0) for each data point is in the exponential 
family, the above aggregate likelihood for the whole data set remains 
computationally tractable. In fact, products of the exponential family 
are in the exponential family. In other words, the e-family forms a closed 
set under multiplication (but not under addition, unfortunately). For 
example, we would obtain the following likelihood under an e- family 
distribution: 



p{x i©) = ri[ =l exp(A(x t ) + T(Xtfe-fC{e)) 



exp ^E a( < x *) + (e 0 - T/c ( 0 ) j • 



This aggregate P(X |0) is in the exponential family itself. It is just as 
easy to integrate or maximize the likelihood of P{X |@) as it is work with 
a single data point P(X*|0). Also, the 0 model that maximizes likeli- 
hood is straightforward to find. For example, to maximize the likelihood 
over the parameter 0 we equivalently find the maximum of its logarithm 
logP(,T|0). This is done by taking its derivative respect to 0 and set- 
ting to zero. This process yields the following unique (by convexity of 
7C(0)) closed- form solution for the parameters: 



0/C(0) 

de 



1 

T 



T 



£n*.)- 



In fact, we can also consider the case of weighted data in the exponen- 
tial family which also admits an easy maximum log-likelihood solution. 
Here, the log-likelihood of each datum Xt is weighted by a non-negative 
scalar amount wt generating the weight log-likelihood objective function 
^2 t w t log P(X*|0). Such an objective function may reflect additional 
knowledge we have about the data or variations of iid sampling assump- 
tions. The maximum likelihood solution for the e-family parameters in 
this setting is still well behaved: 



0/C(0) 



1 

E t w t 



T 



Y^ w tT(X t ). 

t= l 



de 
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An interesting property of each exponential family distribution is that 
it has a unique conjugate distribution that is its dual. For any expo- 
nential family distribution over the sample space, the conjugate is the 
natural choice for the prior distribution over parameters. The conju- 
gate of the conjugate of distribution is the original distribution itself. In 
particular, the Gaussian mean (with covariance held fixed) is self-dual. 
For instance, recall the multi-variate Gaussian distribution, a popular 
choice for representing data X that is in a D-dimensional continuous 
vector space: 

P(X|/i,£) = Af{X\ii,E) = * e-kix-tfv-'ix-n) 

V 1 ' ^ ‘ ' (2tt) d / 2 v /|S| 

The above Gaussian is written in the traditional form instead of natural 
form (but it is easy to recognize that 6\ = E -1 /i and 62 = — ^E -1 ). The 
conjugate distribution for the Gaussian is the product of a Gaussian 
mean over fi and the inverse- Wishart over the covariance E 



P( M ,S) 



V(mK£) x ZVV(£|V, k) 

I k/2 



= Af(fj,\m,T,) x 






exp (-£ir(5r 1 V')) 



D(D-l) 
7 r * 



n?=ir(^ 






Similarly, recall the multinomial, which is often used to represent discrete 
data (here X is typically a discrete vector of binary entries that sum to 
unity and the subscript X selects its fc’th entry): 

P{x \a) = n£i Where = 1 X k €[0,l] ^X* = l. 

k k 

It has the following Dirichlet distribution for its conjugate: 



P(a\p) = 



ILr(a fc ) 



ipr 1 - 



These conjugate distributions make it straightforward to compute max- 
imum a posteriori estimates since their contribution is equivalent to ob- 
serving additional virtual data points in the maximum likelihood prob- 
lem. Furthermore, Bayesian integration and evidence are easily com- 
putable for any exponential family distribution if the prior that is chosen 
is the conjugate [20, 125]. 



2.4 Maximum Entropy 

The exponential family distributions can alternatively be derived from 
a maximum entropy approach as the class of distributions that maximize 
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entropy while satisfying moment constraints. Maximum entropy theory 
has many connections to maximum likelihood and dates back to early 
work by Jaynes, Kullback and Shannon [82, 167, 105, 113]. The ba- 
sic method argues that a distribution should satisfy constraints on its 
moments while maximizing entropy to have the least commitment or 
information (also known as the principle of indifference or insufficient 
reason). In other words, maximum entropy finds a distribution P(X) 
whose entropy 



H(P) = -J>(X)log P(X) 

X 

is as large as possible while satisfying certain moment or expectations 
constraints. These constraints use pre-specified feature functions fi{X) 
that identify which moment constraints are important for the task at 
hand and scalar constraint values oti which identify what the distribution 
should produce when computing the expected feature- functions. The 
constraints have the following general form: 

Y,P{X)fi{X) = Oi Vi 

These feature functions or moments could be any function of the data 
X. For instance all polynomial powers of X are valid choices as in 
fi{X) = X % and even other nonlinear functions as well. These linear 
constraints (which can also be inequalities) restrict the space of potential 
probability distributions we can explore to a convex hull which we shall 
denote by V . Distributions that occupy V satisfy the above constraints 
as well as normalize to unity Ylx = 1 as any proper distribution 

should. We can imagine casting this constraint as yet another moment 
constraint by using the feature function /o(^0 = 1 with ao = 1. 

The solution for the maximum entropy distribution is interesting since 
it is unique and given in closed form. We derive the solution here for the 
discrete case (derivations for continuous X are similar) by analytically 
maximizing entropy subject to these constraints on P(X): 

P{X) = argmaxif(P) = argmax — P(X) log P(X). 

More generally, instead of maximizing entropy, we could equivalently 
minimize Kullback-Leibler divergence or relative entropy to the uniform 
(highest possible entropy) distribution or any other prior distribution 
Q(X) that captures our intuitions (the uniform typically embodies some 
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notion of prior indifference) : 

P{X) = arg nun KL{P\ |Q) = a rg min F(X) log 

Note that the maximum entropy approach is equivalent to the above cost 
function when Q(X) is uniform. We can represent the constraints on 
P(X) in the above minimization with Lagrange multipliers that ensure 
that it must remain non-negative and integrate to unity in addition to 
remain in the hull V. This gives us the following primal Lagrangian 
optimization problem to minimize: 

C = KL{P\\Q)-^\ ^(^(1)-^! - 7 ^P(X)-1^ 

Taking derivatives with respect to a P(X) and setting to 0 we obtain: 

1 + logP(X) — logQ(-X') — — 7 = 0, 



Isolating P(X) yields: 



P(X) = Q(X) exp ^7 — 1 + Xifi(X) 

The 7 variable is set by noting that the above distribution must be 
normalized. This gives us the following equivalent solution: 

Note that the normalization term on the distribution is effectively a 
function of the A i Lagrange multipliers and we write it as Z( A) to be more 
specific. The solution must also satisfy the constraints corresponding to 
the \ Lagrange multipliers, namely Ylx P(X)fi{X) = Inserting our 
solution into those expressions yields: 




P{X) = zfA) 0mex[, (E 

where Z( A) = Z) Q ( X ) exP ( X 
x \ i 
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Reinserting all the above formulae into our primal optimization problem 
C gives: 

C = X£(i>||Q)-£A i fci>(X)/ i (X)-a j j-7(£P(X)-l 

X i \ X / 

= -logZ(A) -h^AiC**. 

i 

This compact expression involving only the Lagrange multipliers then 
becomes our dual maximization problem V over the Lagrange multipliers 
A as follows: 



V = - log Z{\) + ^ XiCtj. 

i 

Since the negated log-partition function — logZ(A) is concave in A, the 
above dual has a unique maximum. We find the globally optimal A* 
setting of the Lagrange multipliers by maximizing over V. Interesting 
properties emerge such as the maximum entropy or minimum relative 
entropy solution produces the exponential family form we saw earlier. 
In the dual maximization function we are now solving, the A Lagrange 
multipliers take on the role of the parameters © in the previous max- 
imum likelihood exponential family scenario. Note, here we are using 
the term log-partition function loosely since since Z{ A) is basically the 
same as /C(Q) following a simple logarithm operation. Clearly, taking 
derivatives of V and setting to zero would also show the cumulant gen- 
erating property of the partition function as outlined in the previous 
section. Furthermore, the a* take on the role of the sufficient statistics 
in the maximum likelihood problem because the empirical evaluation of 
the constraints will converge to the true expectations: 

1 T 

<*i = X;PpD/i(AD « fi(X t ) where X t ~ P{X) 

X 1 t = 1 

therefore producing an elegant duality between maximum entropy and 
maximum likelihood for the case of exponential family distributions. 
Both methods can be used to build a generative model of the data P(X) 
although in maximum likelihood the practitioner designs the paramet- 
ric form while in maximum entropy the practitioner designs moment 
constraints on the density. 
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More generally, the moment constraints could be inequalities as well. 
In those cases, when we maximize the dual objective function, we must 
also ensure that the A^ remain non-negative for linear inequality con- 
straints of the form or that the A* remain non- 

positive for linear inequality constraints of the form Yhx fi{X) < 
oii. These types of solutions can also be seen from the perspective of 
information geometry as entropy projections since we are minimizing 
distance from a prior Q(X) while projecting onto the convex hull of dis- 
tributions V that satisfies all the required moment constraints. Further 
interesting connections between maximum likelihood and maximum en- 
tropy are elaborated in [96]. Other approaches to learning or optimizing 
maximum entropy models including improved iterative scaling (IIS) may 
be found in [39, 96]. 

2.5 Expectation Maximization and Mixtures 

Nevertheless, many interesting real-world distributions and generative 
models are not part of the exponential family and do not emerge from 
maximum entropy solutions. This is the case for sums or mixtures of 
exponential family distributions and includes many interesting models 
in the machine learning field such as mixtures of Gaussians, incomplete 
data models, hidden Markov models and latent Bayesian networks. In 
general, these types of models are characterized by having variables that 
are either missing, latent or hidden. The variables that we do not ob- 
serve must often be integrated over to compute marginals over what can 
be observed. While products of exponential family distributions seem 
to remain in the family, sums and mixtures do not and hence maximum 
likelihood estimation becomes somewhat more tedious. Consider a mix- 
ture model distribution with a hidden or latent discrete variable m that 
is unobserved and must be marginalized: 

P(X) = J^P(m,X) = £P(m)P(X|m). 

m m 

Assume that each conditional distribution P(X\m) or each joint distri- 
bution P(m,X) is in the exponential family (for example we may have 
a mixture of Gaussians with parameters fi m and for m E [1,M]). 
We still have a problem that the convex combination or summation of 
exponential families is not an exponential family distribution. This is a 
mixture model and it is easy to see that the log-likelihood Z(0) of such 
models is no longer generally concave: 

l(Q) = £>gP(X,|0) - y>gj>(m,*i|0). 

t t m 
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Mathematically, the culprit is the logarithm of the summation which 
makes maximization (or Bayesian integration) awkward. If we had ac- 
cess to the complete data and for each Xt also observed an the prob- 
lem would then no longer require summations and would once again be 
within the exponential family. 

The expectation maximization (EM) algorithm is frequently utilized 
to perform these maximization of likelihood for mixture models due 
to its monotonic convergence properties and ease of implementation 
[13, 12, 40, 123, 110]. The EM algorithm finds its roots as the successor 
of old heuristic approaches to fitting mixtures of models and clustering 
algorithms such as k-means. In the case of mixtures or latent maximum 
likelihood estimation, notions of divide and conquer [96] can be used to 
motivate EM-type procedures. The expectation (E) step consists of com- 
puting a bound on the log likelihood using a straightforward application 
of Jensen’s inequality [93, 143]. The maximization (M) step is then the 
usual maximum likelihood step that would be used for a complete data 
model. However, EM’s strategy of divide and conquer works not because 
it is intuitive but because of the mathematical properties of maximum 
log-likelihood, namely the concavity of the log function and the direct 
applicability of Jensen’s inequality in forming a guaranteed lower bound 
on log-likelihood [40, 13, 12]. Figure 2.3 depicts what EM is effectively 
doing. In an E-step, we compute a lower bound on the log- likelihood 
using Jensen’s inequality to form an auxiliary function £(0|0) < /(0). 
This auxiliary function makes tangential contact (the value and, more 
importantly, the derivatives of two functions are the same) at the cur- 
rent configuration of the model 0. In other words, /(0) = £(0|0). In 
the M-step we maximize that lower bound by increasing the typically 
concave auxiliary function £(0|0). This can be done for instance by 
computing derivatives of it over 0 and setting to zero. By iterating 
these two procedures, we are bounding and maximizing the objective 
function which is therefore guaranteed to increase monotonically. 

EM thus needs a guaranteed bound on the log-likelihood. Recall the 
definition of Jensen’s inequality: f(E{.}) > E{f(.)} for concave /. We 
apply Jensen to the log-sum as follows for alH = 1..T observations: 



£log£P(m,X t |0) 

t m 



t m 



Q(m\t) 

Q(rn\t) 



P(m,X t |0) 



> £( 0 | 0 ) 



> 



Y Y Q{m\t) log 

t m 



P(m,X t |B) 
Q{m\t) 



Here, the bound holds for each Q(m\t) which can be any discrete distri- 
bution over all possible configurations m that is non-negative and sums 
to unity over m. However, for tangential contact in the bound maxi- 
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Figure 2.3. Expectation-Maximization as Iterated Bound Maximization. The solid 
line is the log-likelihood while the dashed lines are the auxiliary function £(0|0) at 
various settings of the bound. In the above example, we iterate different parameter 
settings starting from 0o until 03 . 



mization scheme, we need l(Q) = <2(0|©), which enforces the following: 



Q(m\t) — 



P(m,X t \e) 
£ m P(m,X £ |0)‘ 



Above we have set the Q distribution to the posterior distribution of the 
latent variables or Q(m\t) = P(m\X t ,Q). This posterior is computed 
given the current probability model P(ra, Xt\@) at the configuration 
0. It is traditional to call the terms Q(m\t) the responsibilities or to 
consider all Q(m\t) for t = 1...T as a distribution over the hidden 
variables also called a variational distribution. It is then straightforward 
to maximize the lower bound on log-likelihood. For instance, if the 
complete data P(m, X$|0) is in the exponential family, we simply ensure 
that the following gradient equals zero: 



d £ g| |0) = ^^Q(m|<)^(^(m,X t ) + r(m,Xi) T 0-/C(0)). 

t m 

This effectively becomes a weighted maximum likelihood problem as dis- 
cussed in the previous exponential family section. Here, the weights wt 
range over m as well as t and thus w tm = Q(m\t). Therefore, the maxi- 
mum likelihood problem for the intractable mixture model now only in- 
volves iterating weighted maximum likelihood estimates on exponential 
family distributions. Iterating maximization and lower bound compu- 
tation at the new 0 produces a local maximum of log-likelihood. The 
update rule is given by the following E-step and M-step routines respec- 
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tively: 

Q{m\t) 

0 



P(m\Xt, 0) Vi 



arg max ^ Q(m\t) log 

t m 



P(m,X t |0) 
Q{m\t) 



There are two important generalizations of the EM algorithm which are 
based on avoiding the best possible updates to either Q or 0 [134]. More 
generally, take P(X ) to be the distribution over all data {X \, . . . ,Xt} 
(ignoring for the moment that it is iid) and also Q(M) to be a dis- 
tribution over all hidden variables for the whole dataset (i.e. over all 
t = 1...T data points and hidden variables). We can write the log- 
likelihood and apply Jensen more generally using any variational distri- 
bution Q(M): 

log P{X\&) > E Q(M) {\ogP(X } M\S)}+H{Q{M)). 

Here, we are no longer requiring that P(X) — fj t P(Xt) or Q(M) — 
Y\ t Q{rn\t). For any general distribution that involves hidden variables 
we can lower bound the incomplete likelihood (involving summing out 
over the hidden variables) by the expectation of the complete likelihood 
using any variational distribution (and its entropy) as above. We now 
consider more flexible updates for both the Q distribution and for the 
0 parameters which still converge yet more slowly. This is done by 
realizing that the right hand side of the inequality is proportional to 
the negated Kullback-Leibler divergence between Q(A4) and the pos- 
terior P(M\X, 0). It is then natural to maximize this bound on the 
log-likelihood by incrementally minimizing the KL-divergence as we al- 
ternate between updates of Q(M) and 0 [134]. The optimal updates for 
Q(M ) is still to set it to the posterior Q(M) — P(M |T, 0). Meanwhile 
the optimal update for the parameters is a 0 parameter that maximizes 
the expected complete log likelihood under the fixed Q( A4), namely 
Eq(m){ 1°S P(X i . However, instead of performing a complete M- 

step, it is also reasonable to select a 0 that slightly increases (yet does 
not maximize) the bound subject to a fixed Q. Furthermore, instead of 
performing a complete E-step, we may select a Q which slightly increases 
the bound subject to a fixed 0. These partial steps are still guaranteed 
to converge. 

Another useful property of the above partial stepping approach is 
that we no longer need to update Q(M) with the current posterior but 
another approximate variational distribution within a restricted class. 
This may be the case when the posterior distribution P(M\X,Q) has 
an intractable number of configurations and computing or storing it ex- 
actly is intractable. One approach, then, is to force Q to be completely 
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factorized across each of its latent variables which dramatically simpli- 
fies its storage and subsequent manipulations. Such methods are called 
mean-field methods [75, 81, 161, 134, 58]. We may also only want to force 
partial factorization to avoid intractable posteriors while still permitting 
Q to have some dependencies between the latent variables. This class 
of simplifications are called structured mean-field methods. Both meth- 
ods minimize KL divergence and improve the EM-like bound iteratively. 
The factorization on Q can prevent it from making tangential contact 
with the log-likelihood yet the estimation method will still converge and 
produce useful settings of the parameters 0. Thus, we can find and work 
with Q distributions that approximate the posterior by minimizing the 
KL-divergence to it and still obtain a convergent EM-like algorithm that 
maximizes likelihood iteratively. 

It is also possible to employ the above EM derivations and their par- 
tial or incremental counterparts (including mean-field and structured 
mean-field) to iteratively approximate the integrals for Bayesian infer- 
ence. When distributions are no longer in the exponential family, the 
integrals required in full Bayesian inference become intractable (as does 
maximum likelihood). One solution is to compute bounds on the desired 
terms, as in the EM approach above. This makes the Bayesian inference 
problem tractable for mixtures and hidden or latent-variable models. 
This general approach is known as variational Bayesian inference [3] [57] 
[79] and provides a more precise approximation to Bayesian inference 
than maximum likelihood while keeping computations tractable for la- 
tent models. 

2,6 Graphical Models 

We can upgrade beyond simple mixture models to a more general and 
powerful class of distributions and generative models called graphical 
models. Graphical models permit us to manipulate complicated multi- 
variate distributions by using a graph whose nodes represent the ran- 
dom variables in our system and whose edges represent the dependencies 
between the random variables themselves. This combination of graph 
theory and statistics has endowed graphical models and their cousins, 
Bayesian networks, with a deep and rich formalism which is treated in- 
depth in the several texts [96, 142, 111]. This section provides a brief 
yet useful overview of the material. 

In graphical models, some nodes correspond to observed complete 
data and are denoted by the variable X , some correspond to latent 
missing variables as was the case of m in the mixture model scenario 
and some nodes correspond to parameters such as 0. The graph cap- 
tures the interaction between all variables, missing data and parameters 
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(as well as hyper-parameters, i.e. parameters describing distributions 
over parameters). Furthermore, multiple observations need not be iid 
as they were so far in the maximum likelihood problems but could have 
a more complex conditional dependencies which is easily represented by 
the graph. Recall the directed graphical model or Bayesian network in 
in Figure 2.2. Directed arrow heads indicate a parent-child relationship 
between nodes as we travel from the arrow base to the arrow tip. Such 
directed arrows typically indicate a causal relationship between parent 
and child nodes yet this is not necessarily true in the general case. The 
distribution has the following factorization into conditional distributions 
over each node X{ given its parent nodes nf. 

n 

P(X u ...,X n ) = HP(X,lX n ). 

i— 1 

We can also consider undirected graphical models where nodes are linked 
with lines instead of arrows that just indicate a dependency between two 
variables without any specific directionality. In undirected graphs, the 
distribution factorizes according to a product of non-negative potential 
functions (as opposed to conditional distributions) over the maximal 
cliques C (largest sets of fully interconnected nodes) of the undirected 
graphs: 

P(X u ...,X n ) = |n^c) where Z = ]T J]>(*c). 

^ cec X\ cec 

In the above, C is the set of all such maximal cliques (each is denoted 
by C) and Xc is the set of all random variables in the clique. Also, 
Z is a normalizing constant that ensures that the above distribution 
integrates to unity. The functions f tj){ Xc ) are called potential functions 
since they need not be valid conditional or marginal distributions. One 
can convert a directed graph into an undirected graph through a process 
called moralization where common parents of a node get linked with 
undirected edges. We then we drop all the arrow heads in the directed 
graph leaving behind undirected links in their place. This removes some 
of the independency structures in the directed graph yet only slightly. 

The advantages of graphical representations go beyond just visualiza- 
tion and ease of design. The graphical models also constrain the gener- 
ative probability models significantly reducing their degrees of freedom 
and permitting their efficient storage. For instance, if we are dealing 
only with binary random variables, the joint distribution P(X i, . . . , X n ) 
would be of size 2 n in the general case. If we are dealing with n = 100 
random variables, this probability density would be impossible to store. 
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However, assume that we know the probability obeys a graphical model 
such that the 100 variables only appear as 80 cliques of 3 variables each 
(in other words assume the cardinality \Xc\ = 3 for all cliques). We can 
thus store the pdf as a factorized undirected graph with 80 x 2 3 scalar 
entries, a much small number than 2 100 . Such graphical models also lead 
to more efficient algorithms for manipulating the the probability density 
functions as we will outline below. 

The main manipulations we need to perform with such distributions 
are marginalizing, conditioning, taking expectations, computing max- 
imum likelihood and computing Bayesian integrals. Even computing 
marginals from a large joint distribution above may be awkward since 
we need to sum over all configurations of the variables. For instance, 
if we were to compute the marginal distribution of P(Xi 1 Xj c ) from the 
above directed or undirected model, we would sum P(X i, . . . , X n ) over 
all the configurations of the other variables. Consider the case where 
all X \ , . . . , X n are binary. Then, the distribution is of size 2 n and we 
need to perform a total of 2 n ” 2 summations to compute a single entry 
of the marginal P(Xi,Xk)- Clearly, for a large n number of variables, 
this becomes intractable. 

Such an intractable marginalization might arise while we are apply- 
ing the EM algorithm to estimate a distribution with latent variables. 
In particular, the community often deals with latent graphical models 
or latent Bayesian networks where the graphical model contains nodes 
that haven’t been observed. For instance, the variable m was a latent 
variable in the simple mixture model case. Many standard operations 
quickly seem intractable if the distribution has many interdependent la- 
tent variables. This problem persists even after EM is invoked since we 
may have to consider a large number of configurations of latent vari- 
ables in our graphical model. For instance, consider working with the 
likelihood of a probability model P(M.,X |0) which has non-trivial de- 
pendencies between its many variables and does not factorize as our 
previous iid (independent identically distributed) examples did. Sim- 
ilarly, the variational distribution Q(M ) which is set to the posterior 
Q(AA) = P(A4 1 Af, 0) may also involve a large number of interdependent 
terms. Computing this posterior during the E-step may quickly become 
intractable if there is a large number of variables (hidden or visible) since 
we need to convert the joint distribution into a normalized distribution 
over the posterior by marginalizing. Recovering the marginal Q(A4) 
and its sub-marginals during EM iterations could therefore become in- 
tractable. 

For instance, consider a hidden Markov model (HMM) which is a type 
of latent graphical model. This probability model is over temporal se- 
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quences or strings and is used in applications such as speech recognition 
and protein/gene sequence modeling [150, 7]. Assume we have observed 
a string of data that consists of T observations X = {X \ , ...,Xr}. 
An HMM has the probability distribution P(M, X\Q) where here we 
take A4 as the missing data. The missing data contains discrete hidden 
Markov states {mi, . . . ,mr}. These hidden states evolve according to 
a stationary Markov transition matrix which is represented by fixed ta- 
ble P(mt\mt-i, 0). The probability distribution for the hidden Markov 
model is then: 

T 

P(M,X) = 

t = 2 

Here, for brevity, we are not showing the conditional dependence on 0 
for all the probability densities above (joint, marginal and conditional 
densities). More formally, all the above should be written as P(.|0) 
instead of P(.). Since the XA variables are not observed, we would 
typically employ an EM algorithm for estimating the model parameters 
0. Given observations X = {Xi ,...,Xr}, we need to compute the 
posterior over the hidden variables Xi as follows: 

Q(mi,. . . ,m T ) = Q{M) = P{M\X) = oc P{M,X). 

In fact, during our EM learning algorithm, we may be interested in only 
the sub- marginal probability over a single hidden state Q(rrit). This is 
the case when we need to compute or maximize the expected complete 
log-likelihood. Clearly, computing this marginal in a brute force way 
involves summing over many configurations. If the variables m t are all 
M-ary (in other words, have a cardinality \m t \ — M), this would involve 
on the order of M T operations. However, a more efficient route is possi- 
ble. This is done by converting the hidden Markov model into a junction 
tree and employing the so-called junction tree algorithm (JTA). We only 
briefly outline the JTA here, see [96] for more details. We will use JTA 
to efficiently compute marginals and sub-marginals of large probability 
distributions such as the posterior Q(mi, . . . ,mr). Basically, the junc- 
tion tree algorithm permits us to compute marginals and other aspects of 
the distribution ensuring that these all obey consistency requirements on 
the fully joint probability density without having to manipulate and sum 
over the intractably large pdf directly. Figure 2.4 depicts the steps in 
converting a hidden Markov model’s directed graph representation into 
a junction tree. We first perform moralization by joining common par- 
ents. Then, we perform triangulation of the resulting graph. Triangula- 
tion allows us to draw more undirected links creating additional possible 
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(a) HMM Directed Graphical Model (b) HMM Undirected (Triangulated) Graph 




Figure 2.4 • Converting a Hidden Markov Model into a Junction Tree. 



dependencies between the variables. Adding links does not change the 
resulting marginalization problem or solution, it only makes it require 
more work to perform. Essentially, brute force marginalization of order 
M T operations is equivalent to connecting all nodes. We triangulate the 
graph by sparingly drawing additional undirected links such that there 
are no cycles of 4 or more nodes in the graph. For both moralization 
and triangulation, the hidden Markov model does not really change at 
all. We then find all maximal cliques C G C in the graph and place 
potential functions 'ijj(Xc) over their nodes. These cliques now need to 
be connected into a tree. This is done by finding the number of overlap- 
ping nodes between all pairs of distinct cliques. These overlapping sets 
of nodes are called separators. We grow a junction tree via Kruskal’s 
algorithm [194] by greedily linking cliques that have the largest sepa- 
rator values while preserving the definition of a tree by skipping links 
that create loops. If the largest separator requires us to add a clique 
that creates a loop, we skip it and go to the the next largest cardinality 
separator candidate. Once all cliques are connected, we have a tree of 
interconnected cliques called a junction tree as in Figure 2.4(c). Cliques 
are shown in as oval-shaped nodes over the random variables that form 
the clique. We can also consider the separators between all connected 
cliques and represent these with square shaped nodes. These separa- 
tor nodes Xs for S £ S also have their own potential functions labeled 
<j){Xs) or <^(Xs). We sometimes write these potentials as ips to represent 
ip( Xs ) in short form. 
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The junction tree algorithm is then initialized by pasting into the 
clique functions their respective conditional probabilities from the orig- 
inal directed graph’s marginalization problem we are trying to solve. 
These are the conditional probabilities we can easily read off from the 
directed model that have the same domain as the potential functions in 
terms of random variables they have in common. For the hidden Markov 
model, the conditional probabilities we impute into the cliques are: 

X t ) <r- P(X t \m t ) m t -i ) <- P{m t \m t -i). 

The separator functions are all initialized to unity in each of their cells: 

0(mt) <- 1 c(ra t ) <- 1. 

Note that in the HMM we have observed X and, therefore, the potential 
functions are only over the variable 0 ( mt ) since X t is already 

given by X t = X t . 

We then need to perform belief propagation by having adjacent clique 
nodes transmit their current estimate of the marginal over the separator 
nodes they share. This is done by updating a given clique 'ifw with 
information from one of its neighboring cliques 0 y. The information 
that is communicated goes through their adjacent separator 05 . The 
clique 0y communicates its estimate of what 05 values should be by 
summing out over all its variables except those in 05, in other words 
Ylv/S- Then, we update the separator and the neighboring clique 0 vv 
as follows: 



4% = i>w = iPv = i>v- 

v/s vs 

This update is done in two separate sweeps on the junction tree. First, 
messages are passed by a collect operation. Messages and clique/separator 
updates are passed towards a designated root node, updating cliques and 
separators on their way. For example, select 0 (m3, 7714) as the root and 
pull messages towards it. Second, we perform a distribute operation 
which pushes messages away from the root to all leaves updating each 
separator and clique a second time to obtain 0 ** and 0 **. The end result 
is that all the clique and separator potential functions will then settle to 
the value of the marginals over the variables in their arguments: 

V# °c P(Xw) 4>*s P{Xs). 

Furthermore, we note that the sum total of each potential or separator 
function Ylx w equals the normalizer Z of the the undirected 
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model. If the undirected graph was properly normalized with Z a priori, 
the total of each clique settles to unity. For the hidden Markov model 
clique potentials we would obtain: 

ip(m t ,X t ) oc P(m t \X) oc P{m t , m t -i\X). 

Similarly, the hidden Markov model separator potential functions will 
settle to marginals over their variables as well: 

4>{m t ) oc P(m t \X) s(m t ) oc P{m t \X). 

We can then immediately obtain various sub-marginals of interest in the 
hidden Markov model for inference or performing an M-step in the EM 
algorithm. In fact, all that is needed for the E-step in the HMM are the 
marginals P(mt) and pairwise marginals P(rrit\mt-i) which the junction 
tree algorithm produced just as efficiently for us as the traditional Baum- 
Welch algorithm for HMMs [150, 13, 12]. However, JTA is a more general 
algorithm that applies to general graphical models. In addition, JTA can 
also be used to efficiently compute the max (or min) of a pdf over in this 
way by replacing summations Y2v/s m update rule with maximization 
maxy/s (or minimizations minyys). Then, the potential functions will 
settle to: 

oc ma xP(X) d>%* oc maxP(A'). 

x/w x/s 

Other necessary learning operations can also be done efficiently by ex- 
ploiting the graphical model’s structure including estimating the incom- 
plete likelihood or maximizing the expected complete likelihood. This 
allows the user to may develop efficient generative learning algorithms 
for hidden Markov models as well as more general latent graphical mod- 
els. For additional details the reader should consult [96, 142, 111]. 

3. Conditional Learning 

While generative learning seeks to estimate a probability distribution 
over all the variables in the system, it is possible to be more efficient if 
the task we are trying to solve is made explicit. If we know precisely 
what conditional distributions will be used, it is more appropriate to 
directly optimize the conditional distributions instead of the generative 
model as a whole. Since we often use our probability models almost ex- 
clusively to compute conditionals over the outputs (response variables) 
given the inputs (covariates), we can directly optimize parameters and 
fit the model to data such that this task is done optimally. This is 
not quite discriminative learning since we are still fitting a probability 
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density in the output distribution and we have a generative model of 
the outputs given the inputs, i.e. P(Y \X). In a purist discriminative 
setting, we would only consider the final estimate Y and extract that 
from the distribution in a winner-take-all type of scenario. We view 
conditional learning as an intermediate between discriminative and gen- 
erative learning. We are still optimizing a probability distribution, but 
only the one that we will ultimately use for classification or regression 
purposes. In the spirit of minimalism, we do away with the need to learn 
the joint generative model, P(X,Y), and focus only on the conditional 
distribution P{Y\X). 

3.1 Conditional Bayesian Inference 

One can obtain a conditional density from the unconditional (i.e. 
joint) probability density function in Equation 2.1 and Equation 2.2 
yet this is roundabout and can be shown to be suboptimal. However, it 
has remained popular and is convenient partly because of the availabil- 
ity of powerful techniques for joint density estimation (such as EM and 
variational Bayes). If we know a priori that we will need the conditional 
density, it is clear that it should be estimated directly from the train- 
ing data. Direct Bayesian conditional density estimation is defined in 
Equation 2.3. The vector X (the input or covariate) is always given and 
the Y (the output or response) is to be estimated. The training data is 
explicitly split into the corresponding X and y vector sets. Note here 
that the conditional density is referred to as P{Y \X) C to distinguish it 
from the expression in Equation 2.2. 

p(y\x) c = p{Y\x,x,y) = fp{Y,e c \x,x,y)de c 

= / p{y\x, e c , x, y)P{e c |x, x, y)d& c (2.3) 
- / p{Y\x,e c )P{e c \x,y)dQ c 

Here, 0 C parameterizes a conditional density P{Y\X). 0 C is exactly 
the parameterization of the conditional density P(Y\X) that results from 
the joint density P(X,Y) parameterized by 0. Initially, it seems intu- 
itive that the above expression should yield exactly the same conditional 
density as before. It seems natural that P(Y \X) C should equal P(Y\X)i 
since the © c is just the conditioned version of 0. In other words, if the 
expression in Equation 2.1 is conditioned as in Equation 2.2, then the 
result in Equation 2.3 should be identical. This conjecture is wrong. 

Upon closer examination, we note an important difference. The 0 C we 
are integrating over in Equation 2.3 is not the same 0 as in Equation 2.1. 
In the direct conditional density estimate (Equation 2.3), the 0 C only 
parameterizes a conditional density P(Y\X) and therefore provides no 
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information about the density of X or Af. In fact, we can assume that 
the conditional density parameterized by © c is just a function over X 
with some parameters. Therefore, we can essentially ignore any relation- 
ship it could have to some underlying joint density parameterized by 0. 
Since this is only a conditional model, the term P(Q c \X,y) in Equa- 
tion 2.3 behaves differently than the similar term P(Q\Z) — P(@\X,y) 
in Equation 2.1. This is illustrated in the manipulation involving Bayes 
rule shown below: 



p{e c \x,y) 



P(y\e c ,X)P(e c ,X) _ P(y\Q c ,X)P(X\Q c )P(G c ) 

p{xy) “ p(*,y) 

p(y\e c ,x)p{x)p(e c ) 

P(*,y) 



In the final line of the above equation, an important manipulation is 
noted: P(X |0 C ) is replaced with P{X). This implies that observing 0 C 
does not affect the probability of X. This operation is invalid in the joint 
density estimation case since 0 has parameters that determine a density 
in the X domain. However, in conditional density estimation, if y is not 
also observed, 0 C is independent from X. It in no way constrains or pro- 
vides information about the density of X since it is merely a conditional 
density over P(Y \X). This independence property does not always hold. 
However, here we are strictly assuming that the parameterization 0 C is 
such that there is only a conditional functional dependence between the 
parameters and the input variables (i.e. no marginal distribution over 
X should be induced from 0 C ). The graphical models in Figure 2.5 
depict the difference between joint density models and conditional den- 
sity models using a directed acyclic graph [111, 92]. Note that the 0 C 
model and the X are independent if 3^ is not observed in the conditional 
density estimation scenario. In graphical modeling terms, the 0 joint 
parameterization is a parent of the children nodes X and y. Meanwhile, 
the conditional parameterization 0 C and the X data are co-parents of 
the child y (they are marginally independent). Equation 2.4 then finally 
illustrates directly estimated conditional density solution P{Y\X) C . 





(b) Conditional Density Estimation 



Figure 2.5 . Graphical Models of Joint and Conditional Density Estimation. 
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P{Y\X) C = f P(Y \X, 0 C )P(& C \x, y)de c 

= / P(Y\X, 9 C ) (2.4) 

= mm f r<y \ x x)pm» 

If a conditional density is required, it appears superior to perform con- 
ditional Bayesian inference than to perform joint Bayesian inference and 
subsequently condition the answer. This is illustrated with an example 
below. 

Joint versus Conditional Bayesian Inference In the following, we present 
a specific example to demonstrate the difference between the superior condi- 
tional estimate P{Y \X) C versus the conditioned joint estimate P(Y\X) 3 (more 
details are in the Appendix of [83]). We demonstrate this with a simple 2- 
component 2D Gaussian mixture model with identity covariance and equal 
mixing proportions as shown in Figure 2.6(a). The likelihood for a data point 



o 

: o 

(a) (b) (c) 





Figure 2.6. Conditioned Bayesian versus Conditional Bayesian Inference. 




Z = (X,Y) is: P(Z|0) = l/2M(Z\p) + l/2M(Z\u). The prior P(0) over 
the parameters (i.e. the two means 0 = {/7, V}) is a wide zero-mean, spher- 
ical Gaussian distribution (with very large covariance cr 2 ). We can infer the 
standard joint Bayesian distribution from a total of T training data points by: 

p(x,y) cx J p{x,Y\e)p(x,y\e)p(@)de 

oc J p(x,Y\e)nf ;=1 p(x i ,Y i \Q)p(Q)de 

oc J P(X, Y|©)Ilf =1 (l/2Af(Zi\p.) + l/2Af(Zi\u)) P(©)d©. 
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The above equation can be solved exactly if we expand the products over 
the two terms in the mixture model. Unfortunately, these grow exponentially 
fast at 2 t (from all possible assignments of the data points to the two Gaus- 
sians) but can be computed for small data sets. For the 4-point data set in 
Figure 2.6(b), we compute the joint Bayesian inference and plot P(X,Y) as 
shown in Figure 2.6(c). Conditioning this P(X,Y) on X gives us the condi- 
tional P(Y\Xy (the superscript j shows that this conditional came from the 
joint Bayesian inference). The function P(Y\Xy is plotted in Figure 2.6(d) 
for the value X = — 5. We then proceed to compute P(Y\X) C directly using 
another integration, the conditional Bayesian inference as follows: 



P(Y\X) oc 
oc 



oc 



J p(y \x, e)p(y\e,x)p(e)de 

J P(Y\X, Q)Uj =l P( yi \xi, 0)P(0)d0 



f P(X,Y\0) T 
J J Y P(X,Y\0)dY ,= 



P(Xj,Yi\0) 
f Y P{X u Y\0)dY 



P(0)d0. 



The resulting function P(Y\X) C is different from P(Y\Xy and is plotted in 
Figure 2.6(e). Note how conditional Bayesian inference captures the bimodal- 
ity of the data which was lost with the inferior regular Bayesian inference. 



3.2 Maximum Conditional Likelihood 

As in Bayesian inference, integration in conditional Bayesian inference 
(Equation 2.4) is typically intractable to evaluate in closed form. To 
approximate the average over many models, we will often simply pick 
one model at the mode of the integral. This results in the corresponding 
maximum conditional a posteriori (MAP 0 ) and maximum conditional 
likelihood (ML 0 ) solutions as in: 

P(Y\X)° » P(Y\X,®*) 



where 0* 



arg max P(3^|0 C , X)P(Q°) MAP C 

argmaxP(V|0 c ,*) ML C . 



The a posteriori solution allows the use of a prior to regularize the es- 
timate while the conditional likelihood approach merely optimizes the 
model on the training data alone which may cause overfitting. We typi- 
cally find the maximum of the logarithm of the above quantities to ob- 
tain the best model. We expand the above for the maximum conditional 
likelihood case as follows: 

lo g P(V|0 c ,*) = J ^logP(Y t \X t ,Q c ) = E lQ g P p(X t \&^) ] 

= ^lo g P(F t ,Xi,0 c )-^log f P(Y,X t \@ c )dy. 

t t ** Y 
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The above optimization of conditional likelihood is very similar to the 
one for maximum likelihood except for the extra negative term which 
is often referred to as the background probability since it is a marginal 
over the input distribution. Thus, we are trying to maximize the joint 
likelihood of input and output while minimizing the marginal likelihood 
over the input data. This sets up an interesting metaphor where a 
class conditional model is attracted to data it should fit through the 
joint likelihood but repelled by the background or data that does not 
belong to the model’s class. Many other criteria are actually conditional 
likelihood in disguise which sometimes causes confusion. For example, 
in the speech recognition literature, conditional maximum likelihood is 
referred to as maximum mutual information [ 152]. Currently, hidden 
Markov models in the speech community are being trained with these 
conditional criteria [152, 200] to obtain state of the art performance on 
large corpus data sets. 

Unlike maximum likelihood which has been able to handle incom- 
plete and latent models for years with the EM algorithm, conditional 
likelihood has been traditionally difficult to maximize, especially in a 
mixture model scenario. In fact, maximizing conditional likelihood in 
non-latent models will still give rise to computational difficulties since 
the background probability involves a log of a sum or a log of an inte- 
gral over the outputs (classes or scalars) which might break concavity. 
Most approaches have to resort to gradient descent [16, 98]. Other vari- 
ants of gradient descent and line search are also emerging in statistics 
[46]. Recently, the conditional version of the EM algorithm has been 
proposed in the CEM algorithm [88, 89]. As in EM, conditional expec- 
tation maximization (CEM) iterates between bounding the conditional 
likelihood and solving the resulting simpler complete data maximization. 
This converges iteratively and monotonically to a maximum conditional 
likelihood solution. As in variational Bayes, a similar use of the CEM 
bounds on conditional posteriors and conditional likelihoods prior to 
integration can result in a tractable approximation to the conditional 
Bayesian inference. This would provide a generative model that is more 
optimized for the task at hand while still relying on a fully Bayesian 
formalism. 

3.3 Logistic Regression 

Maximum conditional likelihood (and in a sense maximum likelihood) 
is also very closely related to logistic regression, a popular technique in 
the statistics community [72, 119, 64]. Logistic regression is a condi- 
tional distribution of a binary output variable y given a input vector x. 
Typically, the conditional model is given by the following formula (where 
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0 is a parameter vector) P(y — l\X) = 1/(1 + exp(— 0 T X)). This gen- 
erates a linear classifier which is varied by the parameter vector 6 and 
is also referred to as a generalized linear model. There are various ways 
to augment the framework by computing higher order features from a 
given X vector. These include handling discrete X values by considering 
indicator features as in [39, 106]. 

4. Discriminative Learning 

Discriminative learning goes beyond the conditional learning perspec- 
tive and is even more minimalist. Here, only the final mapping from an 
input (X) to output ( Y ) is important and the final estimate Y that will 
be produced is considered [158]. Even the estimation of a conditional 
distribution P{Y\X) is also viewed as an unnecessary intermediate step 
just as we previously argued that the estimation of a joint distribution 
P(X,Y) may be inefficient. It should be noted that this distinction be- 
tween conditional learning and discriminative learning is not necessarily 
a well established convention in the field. Alternatively, we may con- 
sider other quantities resulting from the classifier, for example margin 
distances from the decision boundary to the nearest exemplars. Thus, 
discriminative techniques only consider the decision boundary or the 
regression function approximation in evaluating the parameters for a 
model. Since our learning algorithm is closely matched with the final 
task of the system, discriminative learning techniques will not squander 
resources on an intermediate goal like generative modeling. The result- 
ing performance of the classifier and regressor are directly evaluated and 
improved during learning. Since we can no longer consider a distribution- 
based criterion, Bayesian methods, priors and likelihood-based learning 
techniques are not immediately applicable. 

4.1 Empirical Risk Minimization 

As opposed to the previous sections where we started with the averag- 
ing based solutions (Bayesian integration) and moved to more empirical 
or local approximations (maximum likelihood), we begin here with an 
empirical approach to optimizing a discriminative classifier or regres- 
sor and show averaging and regularization subsequently. Empirical risk 
minimization (ERM) is a discriminative estimation criterion which does 
not make assumptions about the distribution of the input or the output 
[122, 28, 187, 186, 188]. In ERM, we are typically given a loss function 
of the form /(A^,y^,0) which measures the penalty incurred for a data 
point (where X t is input and yt is desired output) when assigning the 
parameter 0 to our model. If we only concern ourselves with an em- 
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pirical local solution, we will minimize this loss function on the training 
data set (which has a total of T training data points). This average loss 
is also called the empirical risk: 

1 T 
t = 1 

This is meant to be a coarse approximation of the true loss of a classifier 
on the unknown distribution of the samples which is also known as the 
expected risk: 



K(&) - [ P{X,y)l{X,y,®)dXdy. 

JXxy 

In the limit of infinite data, the above loss functions will become al- 
most equal for a given 0 value. Here, 0 specifies a mapping which will 
produce an estimated yt from the input Xt. The loss function mea- 
sures the level of disagreement between y t and yt. Possible choices are 
quadratic loss such as \\yt ~ yt || 2 or binary winner-take-all loss that is 
more appropriate for classification. Although one can often reinterpret 
loss functions as likelihoods under a specific choice of conditional output 
distributions, this may be an awkward process. The important aspect 
is ERM’s emphasis on the actual output and the resulting deterministic 
classification boundary that will be formed. For example, we may choose 
to compute the classification result in a winner take all sense in which 
case the loss will be 0 if our learning machine predicts a binary class 
label appropriately and the loss will be 1 if it fails. This type of hard 
classification is fundamental to discriminative estimation. It would be 
awkward to represent as a conditional distribution (or logistic regressor) 
but one possibility might be a very sharp version of the logistic function, 
in other words: 

p (»=>w = { o ;[ x<s. < 2 - s > 

4.2 Structural Risk Minimization 

Since ERM is only locally attempting to optimize the model to the 
training data, it does not necessarily coincide with the true expected 
risk and may not exhibit good generalization behavior on future data. 
An alternative is to consider augmenting the local solution with a prior 
or regularizer that favors estimates that are more likely to agree with 
future data and are based on a measure of the model capacity. This form 
of regularized ERM has been called structural risk minimization (SRM) 
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which penalizes excessively flexible classifiers by measuring their capac- 
ity through the so-called Vapnik-Chervonenkis dimension [187, 186, 188, 
28, 101]. SRM may not perform as well on the training data as ERM 
but should generalize better to future data. 

An interesting result due to [187, 186, 188, 28] is that the risk or 
expected loss (for samples outside of the training set) is boundable from 
above by the empirical risk plus a term that depends only on the size 
of training set, T, and the VC-dimension, ft, of the classifier. This non- 
negative integer quantity ft measures the capacity of a classifier and is 
independent of the distribution of the data. The following bound on 
the expected loss holds with probability 1 — S and is given here without 
proof: 

ft(log(2T/ft) + 1) - log(S/ 4) 

T 

The SRM principle suggests that we minimize the upper bound on TZ{@) 
to minimize the expected risk itself. Thus we will minimize a combina- 
tion of the expected risk and the VC-dimension ft. For binary linear 
classifiers, this motivates the use of a classifier that separates two class 
data with large margins. Large margins mean that we also try to max- 
imize the minimum distance from the points to the decision boundary 
that separates them. These principles give rise to the support vector 
machine (SVM). SVMs are particularly important in contemporary ma- 
chine learning since they have recently provided state of the art clas- 
sification and regression performance. In many senses, these are the 
current workhorses of discriminative estimation. However, due to their 
fundamentally discriminative formalism, they don’t enjoy the flexibil- 
ity of generative modeling (priors, invariants, structure, latent variables, 
etc.) which limits their applicability. We next give a quick overview of 
VC and support vector machines (an in-depth treatment of the topics 
can be found in [165, 37, 70, 28, 187, 65]). 

4.3 VC Dimension and Large Margins 

The VC dimension of a set of functions measures its complexity and is 
imputed into the above formula to give a bound on generalization error. 
Assume we are dealing with functions that map the input space to a bi- 
nary value. We will search for a good function within this set or hypoth- 
esis class to build our classifier. The VC-dimension is a distribution-free 
quantity that it measures complexity only on the basis of our chosen 
hypothesis class or the space of functions we are exploring. It does not 
depend on the actual training data or the complexity of the dataset we 
are dealing with. Given the set of functions, its VC-dimension ft is equal 
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to the number of points in the input space that the set can shatter. A set 
of functions can shatter h points if there are at least h points X\, . . . , 
in our input space such that for every possible labeling Y \ , . . . , of the 
h points, there exists at least one function in our set that will get zero 
training error and perfectly classify all h points. For example, consider 
the space of linear classifiers acting on 2D data in Figure 2.7. Here, we 
can see the VC-dimension of a linear classifier in 2D is h — 3 since we can 
perfectly classify the 3 points with a linear decision boundary no matter 
what labeling we choose for them. In higher dimensions, it is straight- 
forward to show that linear classifiers or hyperplanes in D-dimensional 
space have a VC dimension h = D + 1 and will be able to shatter D + 1 
points. 
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Figure 2.7 . Shattering and the VC Dimension of Linear Classifiers. 



The above generalization bound tells us that we can expect better 
generalization when VC dimension is lower and we use a more restricted 
set of functions or classifiers when learning. Yet our computation for h 
only tells us that we would prefer a lower dimensional linear classifier for 
better generalization. How does reducing VC dimension relate to and 
encourage us to build a large-margin linear classifier? Let us consider 
a subtle variation on linear classifiers called a gap-tolerant classifier as 
depicted in Figure 2.8. The classifier here consists of two parallel hy- 
perplanes separated by a margin m whose classification decisions are 
constrained to a sphere in M D with a diameter d. When using this gap- 
tolerant classifier, we only count classification errors that are within the 
sphere and not between the two parallel hyperplanes. In Figure 2.8(a) 
we can see that we can still shatter 3 points in M 2 if the margin is small 
enough. However, if we increase the margin m as in the Figure 2.8(b), 
we reach a point where we can only shatter 2 points in M 2 and therefore 
the VC dimension is lower. For such a set of functions or gap-tolerant 
classifiers in D dimensions, we can give the VC dimension approximately 
via the following inequality: 



h < 



min 
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We might explore minimizing the diameter of the sphere of the gap- 
tolerant classifiers yet typically this is a constant with our dataset. In- 
stead, we will seek to increasing the margin m between the two hyper- 
planes to reduce VC and improves our potential generalization error. 



d 

(a) Gap Tolerant Classifier 



(c) Larger Margin — > h A 3 

Figure 2.8. The VC Dimension of Gap Tolerant Classifiers. 







(d) For Large Margin h = 2 



4.4 Support Vector Machines 

The gap-tolerant classifier can be defined by equations governing its 
two parallel hyperplanes which we call margin-hyperplanes and denote 
by H - and H _ as in Figure 2.9(a). The hyperplane half way between 
these two gives our decision boundary and is denoted H. All these 
hyperplanes are given by the following formulae: 

H -> w T X + b = 0 
H + -> w T X + b = +l 
-> w T X + b=- 1. 

Here we have chosen the arbitrary value ±1 to define the margin- hyperplan 
yet this is without loss of generality since any scaling of w and 6 will 
yield the same H linear decision boundary. The distance of H to the 
origin will be denoted q and is computed by: 

q = min \\X\\ subject to w T X + 5 = 0. 
x 

To find the X on the hyperplane that is closest to the origin, we can in- 
stead solve the following Lagrangian optimization minA^ ^||X|| 2 — \{w T X+ 
6). Setting the Lagrangian’s derivative over X to 0 yields the solution 
X = —A w. Combining that with the constraint w T X + 6 = 0 gives us 
the formula X — —-Jrr^w. Therefore we have the distance to the origin 
as q = \\X\\ = |6|/||uJ||. We can similarly compute the distance q+ of 
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to the origin and the distance <y_ of // to the origin. The margin is 
then the distance from to H+ and is given by: 



m 



= \q+ 



\b-l\ \b+l\ 



W 



W 






To minimize the empirical error on a training dataset with a gap tolerant 
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(b) Polynomial Kernel SVM 



(c) RBF Kernel SVM 



Figure 2.9. Linear and Nonlinear Support Vector Machines. 

classifier, the hyperplanes H _ and separate the data appropriately 
by ensuring that all positive points are on the correct side of and all 
negative points are on the correct side of JT_. This requires the following 
linear constraints for all our data t G 1..T : 

w T X t + b > +1 if y t — +1 
w T X t + b < -1 if i/i = — 1- 

While we ensure these zero-training error classification constraints are 
satisfied, we also attempt to reduce the generalization penalty term by 
lowering VC dimension. So we also maximize the margin, or, equiva- 
lently, minimize ^||'d;|| 2 . The above linear classification constraints and 
this quadratic cost function on w actually form a quadratic program 
on (w, b) which can then be solved easily with standard quadratic pro- 
gramming software. We recover the parameters of the linear decision 
boundary for H which is effectively our large-margin support vector 
machine classifier. Finally, given our unique globally optimal solution 
for w and b (potentially via the output of a quadratic program) , we have 
our support vector machine classifier which can predict the label y of a 
new input point X as follows: 

y = sign(w T X T b). 

However, we rarely solve for the support vector machine in this way and, 
instead, use the dual of the above formulation. Consider the following 
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primal Lagrangian that corresponds to the quadratic program we have 
so far (here we are combining the linear constraints into a more compact 
formula by multiplying by the labels yt ): 

£ = ^INI 2 - (y t {w T X t +b) - l) . 

t 

Since the constraints multiplied by the Lagrange multipliers A t are all 
greater-than inequalities, the X t values must remain non-negative in £. 
Taking derivatives of C with respect to w and b and setting to zero 
gives us the following formula for the linear classifier in terms of the 
Lagrange multipliers w = Y2t A tVt^t as well as the additional constraint 
Y2t ^ tVt — 0. These two formulas can then be reinserted into the primal 
Lagrangian C to obtain the dual Lagrangian problem V which now needs 
to be maximized instead: 

t 1 t,t f 

We can solve the above for the Lagrange multipliers A = {Ai, . . . , At*} 
again just by using standard quadratic programming. Note, however, 
this quadratic cost function over the A term is more elaborate while the 
constraints are simpler. In the dual space, we only have the constraints 
stating that A^ > 0 and Y2t ^tVt — 0. Given the solution for A, we ex- 
plicitly obtain the linear classifier via the above formula w — Y2t ^tVtXt- 
Note that, for each data point in our training set, there is a Lagrange 
multiplier value. One interesting property of the support vector machine 
and this dual solution is that it gives rise to support vectors. These are 
points in our training data that lie on the margin hyperplanes i?_ or 
LZ+ which have a non-zero Lagrange multipliers. All points lying on the 
correct side of the margin region slightly away from the hyperplanes will 
have zero Lagrange multipliers. Thus, we get a sparsified solution that 
is stable relative to these non-support vector points and will not change 
if they were deleted from our training data set or moved (as long as they 
do not make contact or cross into the margin region between H _ or H+). 

The scalar b still needs to be found to explicitly parameterize H and 
this is done via the Karush-Kuhn Tucker (KKT) conditions. Recall that 
the Lagrange multipliers will be non-zero for the support vectors which 
either coincide with if_ for the negatively labeled Xt or coincide with 
H+ for positively labeled X ly . This tells us that all t such that A^ > 0 
we have yt{w T X t + b) = 1. This permits us to solve for b immediately 
for these points (or compute some kind of average of the many solu- 
tions for b for different support vectors). If the vectors have zero- valued 
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Lagrange multipliers, their classification constraints are not active and 
are subsumed by satisfying the constraints from support vectors with 
positive- valued Lagrange multipliers. 

If the classification problem is non-separable, we often introduce slack 
variables in the primal quadratic optimization program which permit 
some points to be misclassified. Otherwise, Lagrange multipliers will 
grow to infinity attempting to properly separate the inseparable labeled 
points and end up causing numerical problems. The slack variables give 
rise to an interesting change in dual optimization which now clamps or 
restricts the growth of the Lagrange multipliers beyond a scalar constant 
c (while still ensuring they stay non- negative). The c value acts as a 
regularizer to avoid over-fitting the support vector machine to outliers. 
The smaller the value of c, the less of an effect a non-separable outlier 
will have on our solution (while a value of c — oo will try to separate the 
data at any price). Note, in the non-separable case, support vectors (that 
are involved in the KKT conditions) now have Lagrange multipliers are 
not only strictly positive but also strictly less than c. This concludes the 
basic framework for linear support vector machines. However, to explore 
the full versatility of support vector machines we must go beyond their 
purely linear decision boundaries. For nonlinear classification problems, 
the SVM can still be used via so-called kernel methods. 

4.5 Kernel Methods 

Generalizing the above support vector machine to the nonlinear case 
is actually quite simple by mapping data from the original M D Euclidean 
space to a typically higher or infinite dimensional space called Hilbert 
space H (see [165] for a detailed exposition). In Hilbert space, what 
was otherwise only nonlinearly separable in the original space, might 
actually be straightforward to separate with a hyperplane. This map- 
ping process is illustrated in Figure 1.3. This mapping is done via a 
function 3> which maps vectors X E to a point <E>(X) in Hilbert 
space 'LL. Of course, we need to specify this mapping (either explicitly 
or implicitly) which determines what will be separable in Hilbert space 
and what nonlinearities we will be able to handle in the original input 
space. For example, we may take 3>(X) to be the vector containing X 
concatenated with the vectorized form of its outer-product with itself, 
i.e. $(X) — [X\ vec(XX T )\. This will permit us to consider quadratic 
decision boundaries in the original space X E M D . Thus, we replace our 
data Xt with &(Xt) in the above formulation to handle classification in 
Hilbert space and deal with nonlinearities. In fact, we need not restrict 
ourselves to Euclidean spaces X E M. D at all and our input space could 
be almost arbitrary. Using such a mapping and only building large mar- 
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gin hyperplanes in Hilbert space permits the input X to be almost any 
object, for example variable-length strings [114, 112, 87]. For more gen- 
erality we will therefore take X to be the space the input data occupies 
and the mapping <f>(X) : X -* H. 

Another important aspect of the support vector machine is that the 
training algorithm and the decision rule are be expressed only in terms 
of inner products between pairs of data points such as object t and 
object t' which is given by $(A t ) r $(X^). It is often the case that 
these inner products can be computed implicitly even when it might 
be difficult or impossible to explicitly expand out 4>(X t ) and 
in Hilbert space and subsequently compute their scalar dot product. 
This inner product in V. is, therefore, a generalization of the scalar dot 
product in the original input space. It is typically written as a function 
of the two objects in the input space and called a kernel: 

k(X u Xi?) = $(X t ) T $(X tl ) = $(**')>• 

The kernel function must satisfy Mercer’s condition. Consider the TxT 
Gram matrix K which is used in the dual SVM optimization problem. 
This matrix is built from arbitrary X\, . . . , Xt inputs in X with its el- 
ements set to Ktf,< = k(X t , Xt')- Mercer’s condition requires that K is 
always positive semidefinite for any choice of X. Other properties of ker- 
nels is that we can define their mapping from input space to Hilbert space 
as 4>(X) = k(.,X) where the dot is an argument of the kernel function 
and X indexes into a set of many kernel functions. The inner product of 
a function / : X — » M with the mapping k(.,X) is equal to the function 
evaluated at the point X, in other words f(X) = (k(., X), f) and k(., X) 
seems to behave like a delta function. We also see the familiar rela- 
tionship that (*(X t ), $(X t ')> = (k(.,X t ),k(.,X t ,)) = k(X t ,X f ). These 
relationships arise from the reproducing property of TL which we often 
refer to more specifically as a reproducing kernel Hilbert space (RKHS) . 

To handle nonlinear classification, we therefore replace all inner prod- 
ucts in the dual formulation of the SVM with k(X t , Xf). Note that 
it is more convenient to solve the dual SVM problem than the primal 
when using kernels instead of scalar dot products since we avoid dealing 
explicitly with $(X). Furthermore, we avoid using the explicit repre- 
sentation of the decision boundary sign(w r ^(X) + b ) but rather use its 
expansion only in terms of the kernels. To compute the binary label y of 
a new input data point X , we avoid directly referencing 4>(V) or 4>(V t ) 
and instead use: 

y = sign 
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Popular examples of kernels, include polynomial kernels which are 
specified by a parameter p indicating the order of the polynomial deci- 
sion boundary that will be used in the original space. Figure 2.9(b) de- 
picts an SVM using a second-order (quadratic) polynomial kernel p = 2 
to separate the two binary classes in a nonlinear classification problem. 
The thicker points in the figure represent the support vectors which had 
Lagrange multipliers A t > 0. We can compute the polynomial kernel ex- 
plicitly by expanding out the input vector X via Q(X) — [1; X; X 2 ; .. X p ] 
and taking dot products in Hilbert space. However this may be inefficient 
since it requires exponentially large expansions for higher-dimensional 
vectors X. Instead, this kernel can be computed implicitly by: 

k{Xt,X v ) = {Xjx t , + iy\ 

Other popular choices of the kernel are the radial basis function (RBF) 
kernels which also have a tunable parameter a which determines the 
smoothness of the SVM’s decision boundary in the original space. Fig- 
ure 2.9(c) depicts an SVM decision boundary when using an RBF kernel. 
As a grows, the boundary appears smoother. Note that the RBF kernel, 
under appropriately small settings of cr, can handle an almost arbitrary 
range of decision boundaries with almost arbitrary complexities almost 
like a nearest neighbor classifier. One particularly interesting property 
of the RBF kernel is that its corresponding 4>(X) mapping and its space 
% are infinite dimensional and therefore the kernel can only computed 
implicitly. The RBF kernel is given by the following formula: 

k{X u X v ) = exp(-~\\X t -X tl \\ 2 ). 

There are many elaborations on kernels and many other interesting prop- 
erties which are beyond the scope of this text, possible other texts that 
delve into further detail include [165]. Many efforts in the field are in- 
vestigating novel kernels for introducing the right types of nonlinearities 
into the SVM problem. In fact, kernel methods are now used to in- 
troduce interesting nonlinearities in many non-SVM machine learning 
approaches including, for instance, principal components analysis [166]. 

5. Averaged Classifiers 

An alternative to the SRM or SVM is to not only consider a single 
solution that fit to the data in conjunction with a helpful regularizer or 
prior but to a weighted combination of many possible classifier models. 
As in Bayesian inference, this may produce better generalization prop- 
erties. We will therefore also consider a discriminatively averaging clas- 
sifiers. For instance, we may attempt to average all linear classifiers that 
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perfectly classify the training data. This maintains the discriminative 
spirit since classification boundaries that are not perfectly separating the 
labeled data will be eliminated in this averaging process. This specific 
learning problem can actually be cast in the framework of Bayesian in- 
ference (or more specifically a conditional Bayesian inference problem). 
One popular approximation to the true Bayesian inference is called the 
Bayes point machine (BPM) [71, 190, 124, 159]. For tractability rea- 
sons, the BPM is not the true result of Bayesian inference but rather a 
single point approximation. Instead of summing the effect of all linear 
classifier models, the BPM uses a single model that is the closest to the 
mean over the continuous space of valid models (i.e. the linear classifiers 
with perfect classification accuracy on training). 

Thus, in a Bayesian way, we would like to average over all linear mod- 
els. However this' averaging is not a soft probabilistic weighting but is 
done according to a discriminative criterion which makes a binary de- 
cision: only include classifiers that perfectly separate the training data. 
This corresponds to the conditional distribution in Equation 2.5. The av- 
eraging over models does bring forth slightly better generalization prop- 
erties for the Bayes point machine (BPM). Unfortunately, in practice, 
the performance does not exceed that of SVMs in a consistent manner. 
Furthermore, the BPM does not easily handle non-separable data sets 
where averaging multiple models that perfectly classify the data would 
yield no feasible solution whatsoever. Also, a practical consideration 
is that the BPM is difficult to compute, requiring more computational 
effort than generative modeling methods and SVMs (if the latter are 
implemented efficiently, for instance using the method of [147]). 

Our main concern is that the BPM and its counterparts were re- 
ally designed to handle linear models or kernel-based nonlinearities. 
Therefore, they are not easily computable for classifiers arising from the 
large spectrum of generative models. For instance, exponential family 
and mixtures of the exponential family cannot be easily estimated in a 
BPM framework. They don’t enjoy the flexibility of generative modeling 
(priors, non- separability, invariants, structured models, latent variables, 
etc.) which limits their applicability. Another discriminative averaging 
framework that addresses and overcomes these limitations is maximum 
entropy discrimination (MED) and will be introduced in the following 
chapter. 

6. Joint Generative-Discriminative Learning 

After having explored the spectrum of discriminative and generative 
modeling, we see a strong argument for a hybrid approach that combines 
these deeply complementary schools of thought. Fusing the versatility 
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and flexibility of generative models with the power of a discriminative 
framework that focuses resources on the given task would be extremely 
valuable. Furthermore, as argued throughout, an averaging based ap- 
proach (as opposed to local or a regularized local fit to training data) 
promises better generalization and a more principled Bayesian treat- 
ment. Several approaches have been recently proposed for combining 
the generative and discriminative methods. These include Bayesian es- 
timation with special priors such as automatic relevance detection [182]. 
However, these have only explored discriminative learning in the context 
of simple linear or kernel-based models and have yet to show applicabil- 
ity to the large spectrum of generative models. 

An alternative technique involves modular combination of generative 
modeling with subsequent SVM classification using Fisher kernels [78]. 
This technique is readily applicable to a large spectrum of generative 
models which are first easily estimated with maximum likelihood and 
then used to form features and probabilistic kernels for training an SVM. 
However, the piece-meal cascaded approach of maximum likelihood fol- 
lowed by large margin estimation does not fully take advantage of the 
power of both techniques. For example, since the generative models are 
first estimated by maximum likelihood, this non-discriminative criterion 
might collapse important aspects of the model and sacrifice modeling 
power. For example, due to a model-mismatch, the pre-specified class 
of generative model may not have enough flexibility to capture all the 
information in the training data. At that point, the model’s resources 
may be misused and encode aspects of the data that are irrelevant for 
discrimination. This may result in poor performance on the required 
task. 

Essentially, discrimination needs to happen as early as possible in 
the model estimation process. Otherwise, in piece-meal approaches, one 
may incur a loss of valuable modeling power if maximum likelihood is 
used prior to working with a discriminative or SVM-related framework. 
For instance, a maximum likelihood HMM trained on speech data may 
focus all modeling power on the vowels (which are sustained longer than 
consonants) preventing a meaningful set of features for discrimination in 
the final SVM stage. Since there is no iteration between the generative 
modeling and the discriminative learning components of such efforts, the 
maximum likelihood estimate is not adjusted in response to the SVM’s 
criteria. Therefore, a simultaneous and up-front computation of the gen- 
erative model with a discriminative criterion would be an improvement 
over piece-meal techniques. 

In the next chapter, we will present the maximum entropy discrimina- 
tion formalism as a hybrid generative-discriminative model with many 
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of the desirable qualities we have so far motivated. The proposed MED 
framework is a principled averaging technique which is be able to span 
the large spectrum of generative models and simultaneously perform 
estimation with a discriminative criterion. 




Chapter 3 

MAXIMUM ENTROPY DISCRIMINATION 



It is futile to do with more what can be done with fewer 1 . 

William of Ockham, 1280-1349 



Is it possible to combine the strongly complementary properties of 
discriminative estimation with generative modeling? Can, for instance, 
support vector machines and the performance gains they provide be 
combined elegantly with flexible Bayesian statistics and graphical mod- 
els? This chapter introduces a novel technique called maximum entropy 
discrimination (MED), which provides a general formalism for marrying 
both methods [80]. 

The duality between maximum entropy theory [82] and maximum like- 
lihood is well known in the literature [105]. Therefore, the connection be- 
tween generative estimation and classical maximum entropy already ex- 
ists. MED brings in a novel discriminative aspect to the theory and forms 
a bridge to contemporary discriminative methods. MED also involves an 
additional twist on the usual maximum entropy paradigm in that it con- 
siders distributions over model parameters instead of only distributions 
over data. Although other possible approaches for combining the gen- 
erative and discriminative schools of thought exist [78, 88, 182, 71], the 
MED formalism has distinct advantages. For instance, MED naturally 
spans both ends of the discriminative-generative spectrum: it subsumes 
support vector machines and extends their driving principles to a large 
majority of the generative models that populate the machine learning 
community. 
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This chapter is organized as follows. We begin by motivating the dis- 
criminative maximum entropy framework from the point of view of reg- 
ularization theory. Powerful convexity, duality and geometric properties 
are elaborated. We then explicate how to solve classification problems 
in the context of the maximum entropy formalism. The support vector 
machine is then derived as a special case. Subsequently, we extend the 
framework to discrimination with generative models and prove that the 
whole exponential family of generative distributions is immediately es- 
timable within the MED framework. Generalization guarantees are then 
presented. 

Further MED extensions such as transductive inference, feature selec- 
tion, etc. are elaborated in the following chapter (Chapter 4). To make 
the MED framework applicable to the wide range of Bayesian models, 
latent models are also considered. These mixtures of the exponential 
family deserve special attention and their development in the context of 
MED is deferred to Chapter 5. 

1. Regularization Theory and Support Vector 
Machines 

We begin by developing the maximum entropy framework from a regu- 
larization theory and support vector machine perspective (this derivation 
was first described in [86]). For simplicity, we will only address binary 
classification in this chapter and defer other extensions to Chapter 4. 
Regularization theory is a field in its own right with many formalisms 
(the approach we present is only one of many possible developments). 
A good contact point for the machine learning reader to regularization 
theory can be found in [62, 148]. 

We begin with a parametric family of decision boundaries: C(X ; 0) 
which are also called discriminant functions. Each discriminant function 
(given a specific parameter 0) takes an input X and produces a scalar 
output. The sign (±1) of the scalar value will indicate which class the 
input X will be assigned to. For example, a simple type of decision 
boundary is the linear classifier. The parameters of this classifier 0 — 
{0, 6} are the concatenation of a linear parameter vector 6 and the scalar 
bias b. This generates the following linear classifier: 

£(X;0) = 9 T X + b. (3.1) 

To estimate the optimal 0, we are given a set of training examples 
{Xl, . . . , Xt} and the corresponding binary (±1) labels { yy i . . . . . yr } • 
We would like to find a parameter setting for 0 that will minimize some 
form of classification error. Once we have found the best possible model, 
which we denote as 0, we can use our classifier to predict the labels of 
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future input examples via the decision rule 

V = sign C(X; 0). (3.2) 

We will form a measure of classification error based on loss functions 
L : M -» M that are applied to each data point. These will depend on 
our parameter 0 only through the classification margin. The margin 2 is 
defined as yt C{Xt\ 0 ) and is positive whenever the label yt agrees with 
the scalar valued prediction C(Xt ; 0 ) and negative when they disagree. 

We shall further assume that the loss function, L : R -» M, is a non- 

increasing and convex function of the margin. Thus, a larger margin 
results in a smaller loss. We also introduce a regularization penalty 
R(&) on the models, which favors certain parameters over others (like a 
prior). 

The optimal parameter setting 0 is computed by minimizing the em- 
pirical loss and regularization penalty 



0 = arg min 



R(e) + J2L(y t C(Xf,e)) 



A more straightforward solution for 0 is achieved by recasting the above 
as a constrained optimization problem: 

min{©, 7l ,..., 7T } fl(©) + Et£(7t) /o ox 

subject to y t C(X t ; 0) - 7 * > 0, Wt. [ ' ' 

Here, we have also introduced the margin quantities: act as slack 

variables in the optimization, representing the minimum margin that 
ytC(Xt ; 0) can admit. The minimization is now over both the parame- 
ters 0 and the margins 7 = { 71 , ... , 77 -}. 

A (linear) support vector machine can be seen as a particular example 
of the above formulation. There, the discriminant function is a linear 
hyperplane as in Equation 3.1. Furthermore, the regularization penalty 
is i?( 0 ) = \Q T 0, i.e., the norm of the parameter vector, which encourages 
large margin solutions. The slack variables provide the SVM with a 
straightforward way to handle non-separable classification problems. For 
the (primal) SVM optimization problem, we have: 

min {0,7} k° Te + ^2t L (lt) 

subject to yt ( d T Xt b) — 7 t > 0, V£. 

At this point, we focus on the optimization over 0 alone and ignore 
the optimization over the slack variables 7 . The effect of the restriction 
is that the resulting classifier (or support vector machine) will require 
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linearly separable data. In practice, we assume the slack variables are to 
be held constant and set them manually to, e.g. unity, = 1 Vt. This 
restrictive assumption is made to simplify the following derivations and 
does not result in a loss of generality. The restriction will be loosened 
subsequently permitting us to consider non-separable cases as well. 

1.1 Solvability 

At this point, it is crucial to investigate under what conditions the 
constrained minimization problem in Equation 3.3 is solvable. For in- 
stance, can the above be cast as a convex program or can 0 be computed 
uniquely? 

A convex program typically involves the minimization of a convex cost 
function under a convex hull of constraints. Under mild assumptions, 
the solution is unique and a variety of strategies will converge to it (i.e. 
axis-parallel optimization, linear-quadratic-convex programming, etc.). 
In Figure 3.1, various constrained optimizations scenarios are presented. 
Figure 3.1(a) depicts a convex cost function with a convex hull of con- 
straints arising from the conjunction of multiple linear constraints. This 
leads to a valid convex program. 




(a) Convex Program 




Figure 3.1. Convex cost functions and convex constraints. 

In Figure 3.1(b) the situation is not as promising. Here, several non- 
linear constraints are combined and therefore the searchable space forms 
a non- convex hull. This prevents guaranteed convergence and yields a 
non-convex program. Similarly, in Figure 3.1(c), we do not have a con- 
vex program. However, here the culprit is a non-convex cost function 
(i.e. R(Q) is not convex). 

Therefore, for a solution to Equation 3.3, we must require that the 
penalty function R(&) be convex , and that the conjunction of the clas- 
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sification constraints for all t forms a convex hull. The intersection of 
linear constraints (under mild conditions) will always form a convex hull. 
In addition, it should be evident that it is unlikely that the intersection 
of multiple nonlinear constraints will form a convex hull. Therefore, it is 
clear that the classification constraints in the regularization framework 
need to be linear or at least consistently mappable to a space where they 
become linear. 



1.2 Support Vector Machines and Kernels 

Inspecting a support vector machine, we can immediately see that the 
penalty function, R(@) = \0 T 6 is convex and that the linear hyperplane 
discriminant will give rise to linear constraints and a convex hull. Thus, 
as is well known, the SVM is solvable via a convex program (actually, 
as a quadratic program [28]) for example using iterative methods such 
as sequential minimal optimization [147]. 

But, what do we do when C(Xt\ 0) is nonlinear? For example, we may 
wish to deal with decision boundaries that arise from generative models. 
These can be computed via the log-likelihood ratio of two generative 
models P(X\6+) and P(X\0-) (one for each class). Here the parameter 
space includes the concatenation of the positive generative model, the 
negative one and a scalar bias 0 = {0 + ,0_,6}. This gives rise to the 
following nonlinear discriminant functions: 



C(X;&) 



S P(X |0_) 



(3.4) 



Unfortunately, these nonlinear decision boundaries generate a search 
space for 0 that is no longer convex, compromising the uniqueness and 
solvability of the problem. 

In some cases, nonlinear decision boundaries (i.e. nonlinear SVMs), 
can be handled via the so-called kernel trick [165]. If a decision bound- 
ary is nonlinear, one can consider a mapping of the data through some 
function 4>pTf) into a higher dimensional feature space. Therein, the 0 
parameter vector parameterizes a higher dimensional hyperplane effec- 
tively mimicking the nonlinearity in the original low dimensional space. 
Furthermore, the constraints become linear and the search space forms 
a convex hull. 

One subtlety here, however, is that regularization penalty is now dif- 
ferent in the feature space than in the original space. Therefore, if we 
had a quadratic R(Q) penalty function in the original space, we would 
obtain some possibly complicated expression for it in the feature space. 
This is reasonable in the case of SVMs since the VC-dimension gener- 
alization guarantees hold at the level of feature space (or Hilbert space 
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as in Chapter 2). This permits us to artificially preserve a quadratic 
penalty function in the feature space (which would map to a quite com- 
plicated one in the original space). The kernel is useful since optimizing 
a quadratic penalty function in the feature space only requires inner 
products between the high dimensional vectors $(X t ) and these are im- 
plicitly computable using kernels k(X tj X#) without the explicit mapping 

However, more generally, we may have a specific regularization penalty 
in mind at the level of the original space and/or nonlinearities in the clas- 
sifier that prevent us from considering the high- dimensional mapping 
trick. This problematic situation is often the case for generative mod- 
els and motivates an important extension (MED) to the regularization 
theory discussed so far. 

2. A Distribution over Solutions 

We will now generalize this regularization formulation by presenting 
the maximum entropy discrimination framework [80] [86]. First, we be- 
gin by noting that it is not necessarily ideal to solve for a single optimal 
setting of the parameter 0 when we could instead consider solving for 
a full distribution over multiple 0 values (i.e. give a distribution of 
solutions). The intuition is that many different settings of 0 might gen- 
erate relatively similar classification performance so it would be better 
to estimate a distribution P(0) that preserves this flexibility instead of 
a single optimal 0. Clearly, with a full distribution P(0) we can sub- 
sume the original formulation if we choose P(0) = <5(0,0) where the 
delta function can be seen as point wise probability mass concentrated 
at 0 = 0. This type of probabilistic solution is then a superset of the 
direct optimization 3 . Here, we would like our P(0) to be large when 
0-values yield good classifiers and to be close to zero at 0-values that 
yield poor classifiers. This probabilistic generalization will facilitate a 
number of extensions to the basic regularization/SVM approach. We 
modify the regularization approach as follows. 

Given a distribution over P(0), we can easily modify the regulariza- 
tion approach for predicting a new label from a new input sample X that 
was shown in Equation 3.2. Instead of merely using one discriminant 
function at the optimal parameter setting 0, we will integrate over all 
discriminant functions weighted by P(0): 

y = sign [ P(G)C(X;Q)dO. (3.5) 

Je 

How do we estimate P(0)? Again, we consider an expectation form 
of the previous approach and cast Equation 3.3 as an integral. The 
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classification constraints will also be applied in an expected sense. It 
is inappropriate to directly apply the P( 0 ) arbitrary penalty function 
to infinite dimensional probability density functions such as P(0). In- 
stead of considering an expectation of penalty functions, we will ap- 
ply a canonical penalty function for distributions, the negative entropy. 
Minimizing the negative entropy is equivalent to maximizing the en- 
tropy. Maximum entropy theory was pioneered by Jaynes and others 
[113] to compute distributions with moment constraints. In the absence 
of any further information, Jaynes argues that one should satisfy the 
constraints in a way that is least committal or prejudiced. This gives 
rise to the need for a maximum entropy distribution, one that is as 
close to uniform as possible. Here, we assume Shannon Entropy defined 
as JJ(P(0)) — — f P(0) logP(0)d0. Traditionally in the maximum 
entropy community, distributions are computed subject to moment con- 
straints (i.e. not discrimination or classification constraints). Here, the 
term discrimination is added to specify that our framework borrows from 
the concepts of regularization/ SVM theory and is satisfying discrimina- 
tive classification constraints (based on margin). 

This gives us the following novel MED formulation 4 for finding a dis- 
tribution P(0) over the parameters 0: 

minp(©) - H(P{ 0)) 

subject to f P(& ) [yt£(Xt, 0) — 7 t]e?0 > 0 Vt. 

At present, negative entropy is not very flexible as a surrogate penalty 
function since cannot accommodate prior information we may have about 
the desired P(0) settings. To generalize, we cast negative entropy as a 
Kullback-Leibler divergence from P(0) to a target uniform distribution 
as follows: -H(P{&)) = KL(P(0)||P un iform(6)). Note the standard 
definition of the Kullback-Leibler divergence: 

*L(P( 0 )||Q( 0 )) = J P( 0 ) log ^[||d 0 

which is sometimes written as KL(P\\Q) or D(P\\Q). If we have prior 
knowledge about the desired shape of P(0), we may not necessarily 
want to favor high entropy uniform solutions. Instead, we can cus- 
tomize the target distribution and use a non-uniform one by replacing 
our penalty function with the Kullback-Leibler divergence to any prior, 
/fZ/(P(0)||P°(0)). This gives us the more general minimum relative 
entropy discrimination (MRE or MRED) formulation which we define 
as follows: 

Definition 3.1 The minimum relative entropy discrimination approach 
finds the distribution P(0) over the parameters 0 that minimizes the di- 
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vergence K L(P©||P@) subject to f P( 0 , 7 ) [ytC(Xt,Q) — 7 t]dQ > 0 Vt. 
Here Pg is the prior distribution over the parameters. The resulting 
decision rule is given by y — sig n( f P(Q)£(X; &)d&). 

It is traditional to continue to refer to minimum relative entropy ap- 
proaches as maximum entropy. Therefore, at the risk of confusion, we 
shall adopt this convention in the nomenclature and refer to Defini- 
tion 3.1 as maximum entropy discrimination. At this point, we evaluate 
the solvability of our formulation. 

Figure 3.2 depicts the problem formulation. We note that now we 
are dealing with a possibly infinite-dimensional space, since instead of 
solving for a parameter vector 0 , we are solving for P( 0 ), a probability 
distribution. In the figure, the axes represent the variation of two coordi- 
nates of the possibly continuous distribution, P(0). Instead of jR( 0), a 
penalty function, we have the KL-divergence which is a convex function 
of P(0). Furthermore, the constraints are expectations with respect to 
P(0), which means they are linear in P(0). These linear constraints are 
guaranteed to combine into a convex hull for the search space of P(0) 
regardless of the nonlinearities in the discriminant function! 



kl( p(0j ii Pom ) 




Figure 3.2. MED Convex Program Problem Formulation. 

Therefore, the solution to Definition 3.1 is given by a valid convex 
program. In fact, the solution to the MED classification problem in 
Definition 3.2 is directly solvable using a classical result from maximum 
entropy: 

Theorem 3.1 The solution to the MED problem for estimating a distri- 
bution over parameters has the following form (c.f. Cover and Thomas 




Maximum Entropy Discrimination 



69 



[35]): 

yP°(0) *t[ytC{X t \Q)-'yt] 

where Z{ A) is the normalization constant (partition function) and A = 
{Ai, . . . , At} are a set of non-negative Lagrange multipliers, one per clas- 
sification constraint. The Lagrange multipliers are set by finding the 
unique maximum of the jointly concave objective function 

J{X) = — log Z(A). (3.6) 

This solution is the dual problem to the constrained optimization in 
Definition 3.1 (the primal problem) via the Legendre transform. Under 
mild conditions, a solution always exists, and when it does it is unique. 
Occasionally, the objective function J( A) may grow without bound and 
prevent the existence of a unique solution, however, this situation is rare 
in practice. Furthermore, it is typically far easier to solve the dual prob- 
lem since the complexity of the constraints is alleviated. It is obvious 
that the constraints on the Lagrange multipliers, (i.e. non-negativity) 
are more straightforward to enforce than constraints on the possibly 
infinite dimensional distribution P(Q) in the primal problem. The non- 
negativity of the Lagrange multipliers arises in maximum entropy prob- 
lems when inequality constraints are present in the primal problem (such 
as those representing our classification constraints in Definition 3.1) 5 . 
At this point, we shall loosen the constraint that the margins are fixed 
and allow classification scenarios which are non-separable. 



3. Augmented Distributions 

The MED formulation so far has made the assumption that the mar- 
gin values 7 1 are pre-specified and held fixed. Therefore, the discriminant 
function must be able to perfectly separate the training examples with 
some pre-specified margin. This may not always be possible in prac- 
tice, for instance if the data set is non-separable, and will generate an 
empty convex hull for the solution space. Thus, we need to revisit the 
setting of the margin values and the loss function upon them. First, 
recall that we have so far ignored the loss function in the regularization 
framework as we derived the MED technique since we held the margins 
fixed. However, the choice of the loss function (penalties for violating 
the margin constraints) also admits a more principled solution in the 
MED framework. 

As we had earlier for the case of the parameters, let us also now con- 
sider a distribution over margins, denoted as P( 7 ), in the MED frame- 
work [80]. Typically, for good classification performance (VC-dimension 
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generalization guarantees encourage large margin solutions), we will 
choose margin distributions that favor larger margins. Furthermore, by 
varying our choice of distribution we can effectively mimic various loss 
functions associated with 7. Also, by choosing priors that allow a non- 
zero probability mass for negative margins, we can permit non-separable 
classification (without ad-hoc slack variables as in S VMs) . This will en- 
sure that the classification constraints will never give rise to an empty 
admissible set. The MED formulation will then give a solution over the 
joint distribution, P(0, 7). This gives a weighted continuum of solutions 
instead of specifying a single optimal value for each parameter, as in the 
regularization approach. There is a caveat, however, since the MED 
constraints apply only through expectations over the margin values. We 
are now satisfying a looser problem than when the margin values were 
set, and thus, this transition from margin values to margin distributions 
is less natural than the previous transition from parameter extrema to 
parameter distributions. Since there are multiple margin values (one for 
each training example t), P( 7) is an aggregate distribution over all mar- 
gins and will typically be factorized as P(0,7) = P(0) YltPilt)- This 
leads to the following more general MED formulation: 

Definition 3.2 The MED distribution P(0, 7) over the parameters 0 
and the margin variables 7 = [71 ,..., 77-j minimizes KL(Ps\\Pq) + 
T, t KL{Pv\\P° t ) subject to JP(e,' r )[y t C{X t ,Q)-'n]ded'y > 0 Vi. 
Here P@ is the prior distribution over the parameters and P® is the 
prior over margin variables. The resulting decision rule is given by 
y = sign( f P(&)£(X; &)d& ). 

Once again, a solution exists under mild assumptions and is unique. 
Here, though, the constraints are not just expectation constraints over 
the parameter distribution, but also over an expectation on the mar- 
gin distribution. This relaxes the convex hull since the constraints do 
not need to hold for a specific margin. The constraints need only hold 
over a distribution over margins that can include negative margins, thus 
permitting us to consider non-separable classification problems. Fur- 
thermore, in applying MED to a problem, we no longer specify ad-hoc 
regularization penalty functions via P(0) or margin penalty functions 
such as our L( 7$) loss- functions. Instead, we specify probability distri- 
butions and priors. These distributions can sometimes be more conve- 
nient to specify and then automatically give rise to penalty functions for 
the model and the margins via KL-divergences. More specifically, the 
model distribution will give rise to the divergence term K L(P©, Pq), and 
the margin distribution will give rise to a divergence term K L(P lt \\P® t ) 
which correspond to the regularization penalty and the loss functions re- 
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spectively. Since both terms are based on probability distributions and 
KL-divergence, the trade-off between classification loss and regulariza- 
tion are now on a common probabilistic scale. 

The solution to the non-separable MED classification problem in Def- 
inition 3.2 is given by the following theorem: 

Theorem 3.2 The solution to the MED problem for estimating a distri- 
bution over parameters and margins (as well as further augmentations) 
has the following general form (c.f. Cover and Thomas 1996): 

'• /’((->. 7) - P o (0,7)e^ At[j '* £(Xt|0) ' 7t l. 

Z (A) 

where Z{ A) is the normalization constant (partition function) and A = 
{Ai, . . . , At} are a set of non-negative Lagrange multipliers , one per clas- 
sification constraint. The multipliers , X, are set by finding the unique 
maximum of the jointly concave objective function 

J{X) = — log Z(X). (3.7) 

Further details for the choices of the priors for the parameters and 
margins as well as other distributions will be elaborated in the follow- 
ing sections. It is always possible to recast the optimization problem 
the maximum entropy formulation has generated back into the regular- 
ization form and in terms of loss functions and regularization penalties 
[86]. However, MED’s probabilistic formulation is intuitive and provides 
more flexibility. For instance, we can continue to augment our solution 
space with distributions over other entities and maintain the convex 
cost function with convex constraints. For example, one could include 
a distribution over unobserved labels y t or unobserved inputs X t in the 
training set. Or, we could introduce further continuous or discrete vari- 
ables into the discriminant function that are unknown and integrate over 
them. The distribution P(0) could effectively become P(0,7,y,X, ..) 
and, in principle, we will still maintain the convex program structure 
and dual solution portrayed in Theorem 3.2. These types of extensions 
will be elaborated further in Chapter 4. One important caveat remains, 
however, when we augment distributions: we should maintain a balance 
between the various priors we are trying to minimize KL-divergence to. 
If a prior over models P°(0) is too strict, it may overwhelm a prior over 
other quantities such as margins, P°(7) and vice-versa. Therefore, the 
minimization of KL-divergence will be skewed more towards one prior 
than the other. 
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4. Information and Geometry Interpretations 

There is an interesting geometric interpretation of the MED solution 
which can be described as a type of information projection. This projec- 
tion is depicted in Figure 3.3 and is often referred to as a relative entropy 
projection or e-projection as in [2]. The multiple linear constraints form 
a convex hull that generates an admissible set, V. This convex hull is 
also referred to as an m-flat constraint set [2]. The MED solution is the 
point in the admissible set that is closest in terms of divergence from 
the prior distribution P°(©). This analogy extends to cases where the 
distributions are also over margins, unlabeled examples, missing values, 
structures, or other probabilistic entities that are introduced when de- 
signing the discriminant function. 




Figure 3.3. MED as an Information Projection Operation. 

The MED probabilistic formalism also has interesting conceptual con- 
nections to other recent information theoretic and boosting approaches. 
One point of contact is with the entropy projection and boosting (Ad- 
aboost) framework developed in [162] and [102]. Boosting uses a distri- 
bution that weights each data point in a training set and forms a weak 
learner based upon it. This process is iterated, updating the distribution 
over data and the weak learner for t — 1..T iterations. All hypotheses 
are then combined in a weighted mixture of weak learners called the final 
master algorithm. Effectively, each boosting step estimates a new dis- 
tribution P t+l over the training data that both minimizes the relative 
entropy to a prior distribution P t and is orthogonal to a performance 
vector denoted U 1 . The performance vector U l is of the same cardi- 
nality as P l and has values ranging between [—1,1]. If the previous 
weak learner given by the prior probability distribution correctly classi- 
fies a data point, then the U vector at that training datum’s index has 
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a value close to 1. If the datum is poorly classified, then U- is —1 at the 
corresponding index. Therefore, we update the distribution using the 
following exponential update rule (which follows directly from classical 
maximum entropy results): 

pt +1 (X P* exp(-aUf). 

Instead of considering an iterative approach where individual corrective 
updates are made, we may enforce all the orthogonality constraints we 
have up until now and generate a full convex hull to constrain the entropy 
projection [102]: 



Pf +l oc P- exp 




Kivenen and Warmuth [102] argue that each new distribution should 
provide information not present in the current weak hypothesis given 
the individual orthogonality constraint. When we simultaneously con- 
sider all orthogonality constraints up until time £, the new hypothesis 
should provide new information that is uncorrelated from all previous 
hypotheses. The convex hull of constraints results in the exponentiated 
Ylq = i a t,qUi terms in the above equation which are strongly reminiscent 
of the MED formulation’s exponentiated classification constraints (and 
their Lagrange multipliers). We can therefore interpret the MED for- 
mulation as minimizing a divergence to a prior while extracting as much 
information as possible from the training data. 

Another information-theoretic point of contact can be found in the 
work of Tishby and others [183, 171]. Here, the authors propose mini- 
mizing the lossy coding of input data X via a compact representation X 
while maintaining a constraint on the mutual information, /(AT;F), be- 
tween the coding and some desired output variable, F. This information- 
theoretic setting gives rise to the constrained maximization I{X\X) — 
f3I(X\Y). The result is an efficient representation of the input data 
A, which extracts as much information as possible (in terms of bits to 
encode) from the output variable. A loose analogy can be made to the 
MED framework which solves for a distribution P(0) which minimally 
encodes the prior distribution P°(0) (analogous to the input vectors 
X and X respectively) such that the classification constraints due to 
the training data (analogous to the relevance variables) are satisfied and 
provide as much information as possible. 

An important connection also lies between MED and Kullback’s early 
work on the so-called minimum discrimination information method [105]. 
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The definition Kullback adopts for discrimination is slightly different 
from the one we are discussing here. It mainly involves discrimina- 
tion between two competing hypotheses based on an information metric 
where one hypothesis has to satisfy some additional constraints while be- 
ing as close to the prior hypothesis as possible. The mechanism proposed 
by Kullback is therefore very similar to the maximum entropy formalism 
that Jaynes proposes [82] and he even describes connections to Shannon’s 
theory of communication [167]. Kullback elaborates both these connec- 
tions, but, ultimately, the minimum discrimination information method 
is akin to traditional maximum entropy theory. The information be- 
tween hypotheses involves distributions over the input /output variables 
as opposed to distributions over parameters as in MED. Furthermore, 
the constraints are not margin-based (or even classification-based) as 
in MED, and thus, do not give rise to a discriminative classifier (or 
regressor). Nevertheless, MED seems to be a natural continuation of 
Kullback’s approach and can be seen as a contemporary effort to com- 
bine it with the current impetus towards discriminative estimation as in 
the SVM literature and related generalization guarantees. 

5. Computing the Partition Function 

Ultimately, implementing the MED solution given by Theorem 3.2 
hinges on our ability to perform the required calculations. For instance, 
we need to maximize the concave objective function to obtain the opti- 
mal setting of the Lagrange multipliers A = { Ai , . . . , A t}- 

J(A) = -logZ(A). 

Ideally, we would like to be able to evaluate the partition function Z{ A) 
analytically or at least efficiently. More precisely, the partition function 
is given by: 

Z{ A) = J dQd'y. (3.8) 

A closed form partition function leads to a convenient concave objective 
function that can then be optimized by standard techniques. Possible 
choices include convex programming, first and second order methods 
and axis-parallel methods. Implementation details as well as some novel 
speed improvements (such as learning which Lagrange multipliers are 
critical to maximization) for optimizing J( A) are provided in the Ap- 
pendix. 

An additional use of the partition function comes from a powerful as- 
pect of maximum entropy (and exponential family distributions) inher- 
ited by MED. Gradients (of arbitrary order) of the log-partition log Z(A) 
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with respect to any given variable A are equal to the expectations (of 
arbitrary order) of the corresponding moment constraints with respect 
to the maximum entropy distribution. This permits us to easily compute 
the expectations and variances over P(0, 7 ) of the MED constraints by 
taking first and second derivatives of log(Z). Given a closed-form MED 
partition function, we can conveniently obtain these expectations as fol- 
lows: 



dlogZ(X) 

d\ t 

d 2 logZ(A) 
d 2 A* 



E P{e,~t) {yt£{X-,Q) - 7 1} 



v ar p(e,7) {ytC(x- 0) - 7 1 } . 



The log-partition function Z( X) is also called the cumulant generating 
function since its higher order derivatives generate higher order cumu- 
lants of the linear constraint features or sufficient statistics (the mean 
and the variance are the first and second order cumulants, respectively). 
This forms a deep connection between maximum entropy methods and 
the exponential family [11, 105]. In fact, all maximum entropy distribu- 
tions under linear constraints are exponential family distributions whose 
log-partition function over the Lagrange multipliers is also the cumulant 
generating function over the natural parameters. 

Unfortunately, the integrals required to compute this critical log- 
partition function may not always be analytically solvable. When they 
are, various strategies can be used to optimize J( A). For instance, axis- 
parallel techniques will iteratively converge to the global maximum. In 
certain situations, J(A) may also be maximized using quadratic program- 
ming. Furthermore, online evaluation of the decision rule after training 
from data also requires an integral followed by a sign operation which 
may not be feasible for arbitrary choices of the priors and discriminant 
functions. However, this is usually less cumbersome than actually com- 
puting the partition function to obtain the optimal Lagrange multipliers. 

In the following sections we shall specify under what conditions the 
computations will remain tractable. These will depend on the specific 
configuration of the discriminant function C(X; 0) as well as the choice 
of the prior F°(0, 7 ). In the following section, we discuss various choices 
of margin priors, bias priors, model priors and discriminant functions. 



6. Margin Priors 

We now turn our attention to the margins whose prior distribution 
we will choose to encourage a large margin solution in the same spirit as 
support vector machines and other discriminative learning approaches. 
We can expand the partition function in Equation 3.8 by noting that 
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the distribution factorizes: 

Z{ A) = / p 0 (@ )7 ) e Et*t[ytC(x t -,0)-'y t ] 

= J P°(@)eEt XtytC ( Xt ’ e) dO x 

- Ze(A)xJI^(A t ). 

t 

Recall that our objective function is the negated logarithm of the parti- 
tion function, which we expand as: 

J(A) = -log(Ze(A))-£>g(Z 7 t (A t )) - Je(A) + ^J 7( (A t ). 

t t 

These J 7 t (A*) behave very similarly to the loss functions L^) in the 
original regularization theory approach (actually, they are negated ver- 
sions of the loss functions). We now have a direct way of finding penalty 
terms — J 7t ( Xt) from margin priors P°{jt) and vice-versa. Thus, there is 
a dual relationship between defining an objective function and penalty 
terms and defining a prior distribution over parameters and prior distri- 
bution over margins. 

For instance, consider the following margin prior distribution: 

Pint) = ce -c( 1 _ 7 <) , 7 t < 1 . (3.9) 

Integrating, we get the penalty function (Figure 3.4): 

logZ 7 t (A t ) = log f ce~ c( ' 1 ~' lt ' > e~ Xt ' yt d'Yt = -A t - log(l - A t /c) . 
J'yt —- 00 

In this case, a penalty is incurred for margins smaller than the prior 
mean of 7 * which is 1 — 1/c. Margins larger than this quantity are not 
penalized and the associated classification constraint becomes irrelevant 
(i.e. the corresponding Lagrange multiplier could possibly vanish). In- 
creasing the parameter c will encourage separable solutions and when 
c — > 00 , the margin distribution becomes peaked at 7 1 = 1 , which is 
equivalent to having fixed margins as in our initial MED Definition 3.1. 
The choice of the margin distribution will correspond closely to the use 
of slack variables in the SVM formulation and the choice of different 
loss functions in the regularization theory approach. In fact, the pa- 
rameter c plays an almost identical role to the regularization parameter 
which upper bounds the Lagrange multipliers in the slack variable SVM 
solution. 



d& g ?7 
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Figure 3.4(a) shows the prior in Equation 3.9 and its associated poten- 
tial term (the negated penalty term). Various other margin priors and 
penalty terms that are analytically computable 6 are given in Table 3.1 
and Figure 3.4. 
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Figure 3-4- Margin prior distributions (top) and potential functions (bottom). The 
dotted line indicates the potential function that arises when the margins are fixed at 
unity (which assumes separability). For all plots, the value c = 3 was used. 
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Margin prior P°( 7 *) 


Dual potential term J lt (At) 
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Table 3.1. Margin prior distributions and potential functions. 



Note that all the priors in Table 3.1 form concave potential functions 
(or convex penalty functions), as desired for a unique optimum in the 
space of Lagrange multipliers. It should be noted that some poten- 
tial functions will force an upper bound (via a barrier function) on the 
{A^} while others will allow them to vary freely (as long as they are 
is non- negative). Other priors and penalty functions are also possible, 
in particular, for the regression case which will be discussed later and 
which will require quite different margin configurations. We now move 
to priors for the model, in particular, priors for the bias. 
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7. Bias Priors 

Bias is merely a subcomponent of the model, but due to its interaction 
with the discriminant function, it will be treated separately here. More 
specifically, the bias, 6, appears as an additive scalar in the discriminant. 
Recall that 0 can be seen as a concatenation of all parameters and thus 
we can consider the breakdown: 0 = {0/fe, b}. Recall the following form 
of the discriminant functions from Equation 3.1 (or Equation 3.4): 

£(X;0) = 0 T X + b. 

The bias term arises not only in linear models but many other classifica- 
tion models, including generative classification, multi-class classification, 
and even regression. Evidently, one can always set b to zero to remove 
its effect, or simply set b to a fixed constant, yet the MED approach 
easily permits us to consider a distribution, P(b), over b and to tailor 
the solution by specifying a prior P°(6). Here, we consider two possi- 
ble choices for the prior P°(b) (although many others are possible): the 
Gaussian prior and the non-informative prior. 



7.1 Gaussian Bias Priors 

Consider the zero-mean Gaussian prior for P°(6) given by: 



p°(b) 



1 --4 
- e 2<r2 . 

\f2na 



(3.10) 



This prior favors bias values that are close to zero and therefore a priori 
assumes an even balance between the two binary classes in the decision 
problem. If we have a prior belief that the class frequencies are slightly 
skewed, we may introduce a mean into the above prior which would 
favor one class over the other. The resulting potential term J&( A) = 
— log Zb{\) is 



MX) 





t 



2 



The variance (or standard deviation) a specifies how certain we are that 
the classes are evenly balanced. In terms of the potential function, it 
constrains with a quadratic penalty the balance between Lagrange mul- 
tipliers for the negative class and the positive class. 



7.2 Non-informative Bias Priors 

Evidently, a Gaussian prior will favor values of b that are close to zero. 
In the absence of any knowledge about the bias, it would be reasonable 
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to permit any scalar value for b with equal preference. This will give 
rise to a non-informative prior. This form of prior can be parameterized 
as a Gaussian as in Equation 3.10, but with the variance approaching 
infinity, i.e., a — » oo. This stretches out the Gaussian until it starts to 
behave like a uniform distribution. 

The resulting potential term will naturally be: 




Since we are to maximize the potential terms, if a grows to infinity, the 
above objective function will go to negative infinity unless Y2tVt^t is 
exactly zero. Therefore, the non-informative prior generates the extra 
constraint (in addition to non- negativity) on the Lagrange multipliers 
requiring y t Xt = 0: 

Lemma 1 If the bias prior P°(b) is set to a non-informative infinite co- 
variance Gaussian , the (non-negative) Lagrange multipliers in the MED 
solution must also satisfy the equality constraint: Y2t Vt^t = 0* 

At this point, we have the priors and the computational machinery 
necessary for the MED formulation to give rise to support vector ma- 
chines. 



8. Support Vector Machines 

As previously discussed, a support vector machine can be cast in the 
regularization theory framework and is solvable as a convex program due 
to the linearity of its discriminant function: 

C(X; @) = e T X + b . 



One can also interpret the linear decision boundary generatively by con- 
sidering, for example, the log-likelihood ratio of two Gaussian distribu- 
tions (one per class) with equal covariance matrices. 



£(*;©)= log 



P(X\9 + ) 

P{x\e-) 



4 -b 



We first begin with a linear discriminant boundary since it has a more 
efficient parameterization and with the choice of a simple prior will ex- 
actly synthesize a support vector machine. In particular, if we choose 
a Gaussian prior on the weights 0, the MED formulation will produce 
support vector machines: 
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Theorem 3.3 Assuming a discriminant of the form £(X;0) = 0 T X+b 
and a factorized prior distribution P°(0, 7 ) = P 0 (6)P 0 (b)P°( 7 ) where 
P°(9) isff(0,I), P°(b) approaches a non-inf ormative prior; and P°( 7 ) 
is given by P°( 7 t) as in Equation 3.9; the Lagrange multipliers A are 
obtained by maximizing J( A) subject to 0 < Xt < c and Ylt^tVt = 0, 
where 

JW — 1 °g ( 1 ~ ^t/ c ) ] ~ r ^ ^t^t'ytyt'iXfXt*) . 

t Z M' 

The above J(A) objective function is strikingly similar to the SVM 
dual optimization problem. The only difference between the two is that 
Theorem 3.3 has an extra potential term log(l — A^/c), which acts as 
a barrier function preventing the A values from growing beyond c. In 
an SVM, the Lagrange multipliers are clamped to be no greater than 
c explicitly as an extra constraint in the convex program. In both for- 
malisms, c plays almost the same role by varying the degree of reg- 
ularization and upper bounding the Lagrange multipliers. Typically, 
low c values increase regularization, decrease the sensitivity of the so- 
lution to classification errors, increase robustness to outliers and per- 
mit non-separable classification problems. However, in an SVM, the 
e-regularization parameter arises from an ad-hoc introduction of slack 
variables to permit the SVM to handle non-separable data. If we let 
c grow to infinity, the potential term log(l — A t/c) vanishes and MED 
gives rise to exactly an SVM (for separable data). In practice, even for 
finite c, the MED and SVM solutions are almost identical. 

8.1 Single Axis SVM Optimization 

We can greatly simplify the support vector machine by avoiding the 
non-informative prior on the bias. If we assume a Gaussian prior with fi- 
nite covariance, the equality constraint ^ A tVt = 0 can be omitted. The 
resulting convex program only requires non-negativity on the Lagrange 
multipliers and the updated objective function becomes 

A A) = £ At + log(i - A t/c) - i Y, Ah'ytyt'Xfx t , . 

t t,t' 

It is now possible to update a single Lagrange multiplier at a time in 
an axis-parallel manner. In fact, the update for each axis is analytic 
even with the MED logarithmic barrier function in the non-separable 
case. The minimal working set in this case is of size one, while in the 
SVM, updates to increase the objective must be done simultaneously on 
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at least two Lagrange multipliers at a time as in the Sequential Minimal 
Optimization (SMO) technique proposed by Platt [147]. This gives the 
MED implementation a simpler optimization procedure which leads to 
gains in computational efficiency without any significant change from 
the solution produced under non-informative priors. 



8.2 Kernels 

The MED formulation for SVMs also readily extends to the kernel 
case where nonlinearities (of the kernel type) can be folded in by an 
implicit mapping to a higher dimensional space. The updated MED 
objective function becomes: 

J(A) = t + log(l - - (g - i £ A t \ t ,y tW K(X u X v ). 

t C L tt , 

Our standard inner products Xj Xy are replaced with a kernel function 
of the vectors K(Xt,Xy), as in the SVM literature. The MED com- 
putations remain relatively unchanged, since (in the linear discriminant 
case) all calculations only involve inner products of the input vectors. 



9. Generative Models 

At this point we consider the use of generative models in the MED 
framework. This fundamentally extends the regularization and SVM dis- 
criminative frameworks to powerful Bayesian generative models. Herein 
lies the strength of the MED technique as a bridge between two commu- 
nities with mutually beneficial tools. Consider a two class problem where 
we have a generative model for each class, P(X\Q+) and P(X\6-). These 
two generative models can be directly combined to form a classifier by 
considering their log-likelihood ratios 



£(*;©)= log 



P(X 10+) 

P(X|0_) 



+ 6 . 



Here, the aggregate parameter set is 0 — {0 + ,0_,fe}, which includes 
both generative models and a scalar bias. By simply changing the dis- 
criminant function, the MED framework can be used to estimate gener- 
ative models and guarantee that the decision boundary will be optimal 
in a classification setting. Naturally, the above discriminant function 
is generally nonlinear and will give rise to non-convex constraints in a 
standard regularization setting. However, in the MED framework, due 
to the probabilistic solution P(0), the above discriminant functions still 
behave as a convex program. 
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Figure 3.5. Discriminative Generative Models. In (a) we show the standard maxi- 
mum likelihood estimation of two generative models from the data and the poor clas- 
sifier decision boundary they generate. In (b), MED moves the generators slightly, 
such that they combine to form an accurate classification boundary. 




Estimating P(@) using MED will ultimately yield P(0 + , 0_, 6), which 
can be used to specify the generative models for the data P(X\0+) 
and P(X\Q-). These will be full generative models that can be sam- 
pled from, integrated, conditioned, etc., yet, unlike a direct Bayesian 
framework, these generative models will be also combine to form a high- 
performance discriminative classifier when plugged into the C(X; 0) dis- 
criminant function. Figure 3.5(a) depicts the estimation of a maximum 
likelihood generative model while MED moves the generators for each 
class (ellipses) such that the decision boundary creates good classifica- 
tion separation in Figure 3.5(b). 

Once again, whether or not MED estimation is feasible hinges upon 
our ability to compute the log-partition function Z( A). We will show that 
it is possible to obtain the partition function analytically whenever the 
generative models P(X\Q+) and P(X\0 _) are in the exponential family. 

9.1 Exponential Family Models 

We have argued that functions that can be efficiently solved within 
the MED approach include log- likelihood ratios of the exponential family 
of distributions. Can we compute the partition function efficiently to 
actually implement this estimation? Recall the exponential family form 
in Chapter 2 and its many properties. We rewrite exponential family 
distributions in their natural parameterization as follows: 



P(X\9) = exp{A(X) + X T 0-K{6)) 
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where K{9) is convex. In addition, each exponential family member has 
a conjugate prior distribution also in the e-family which we write as 

P(0\ X ) = exp(A(d) + 0 T X -K(x)) 

where K is also convex. 

Whether or not a specific combination of a discriminant function and 
an associated prior is estimable within the MED framework depends 
on the computability of the partition function (i.e. the objective func- 
tion used for optimizing the Lagrange multipliers associated with the 
constraints). In general, these operations will require integrals over the 
associated parameters. In particular, recall the partition function cor- 
responding to the binary classification case. Consider the integral over 
0 

Zq(\) = J p°(e)e^* XtytL ^ e '>de . 

If we now separate out the parameters associated with the class-conditional 
densities as well as the bias term (i.e. 0+, 0_, 6) and expand the discrim- 
inant function as a log-likelihood ratio, we obtain 

Z e ( A) = J P°(0 + )P°(6-)P°(b)^ lt Am[log ^S-] +b] de . 

The above factorizes as Zq = Zg + Zg_ Z^. We can now substitute 
the exponential family forms for the class-conditional distributions and 
associated conjugate distributions for the priors. We assume that the 
prior is defined by specifying a value for x • It suffices to show that we 
can obtain Zq + in closed form. The derivation for Zq_ is identical. We 
will drop the “+” from Zq + for clarity. The problem is now reduced to 
evaluating 

Z„( A) = j e m+o T x-k{x) e TltMyt{A{x t )+x?o-K(e)) de 

We have shown (see Lemma 1) that a non-informative prior over the bias 
term b leads to the constraint Ylt A tVt — 0- Making this assumption, we 
get 

z e ( A) = e -klx)+i: t MytA{x t ) x J e m+o T (x+Z t XmXt) de 
= y*vtA(x t ) x e £(x+Et MvtXt) 

Here, the second step comes from a natural property of the exponential 
family. The expressions for A, A, K, K are known for specific distri- 
butions and their conjugates and can easily be plugged into the above 
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giving us a closed form log-partition function 

logZ*(A) = Kix + Y^^tyt^ + Y^hytAiXJ-Kix). 

t t 

We see that we can compute the objective function J( A) for any dis- 
criminant function arising from exponential family generative models. 
In fact, integration doesn’t even need to be performed since we have 
an analytic expression of our objective function in terms of the natu- 
ral parameterization for all exponential family distributions. Note that 
the above objective function often bears a strong resemblance to the 
evidence term in Bayesian inference (i.e. Bayesian integration), where 
the Lagrange multipliers act as weights on the Bayesian inference. It is 
straightforward at this point to perform the required optimization and 
find the optimal setting of the Lagrange multipliers that maximize the 
concave J( A). 

9.2 Empirical Bayes Priors 

We have seen that it is feasible to estimate generative models in the 
exponential family form under MED if we assume the priors are given 
by the conjugate distribution. However, the parameters of the conjugate 
priors are still not specified and we still have quite some flexibility in 
incorporating prior knowledge into the MED formulation. In the absence 
of prior knowledge, and whenever possible, we recommend the default 
prior to be either a conjugate non-informative prior or an Empirical 
Bayes prior. 

In other words, as a prior for P°(0), or m'ore specifically for P°(0+) 
and P°(0_), we use the posterior distribution of the parameters given 
the data that Bayesian inference generates. Consider a two-class data set 
{ ( X \ , yi ) , . . . , (Xt ,Vt)} where the labels are binary and of the form y t ± 
1. Thus, the inputs can be split into the positive inputs { Xi_|_, . . . , Xt+) 
and the negative inputs {Xi_, . . . ,Xr_}. 

We now explicate the Bayesian inference procedure. To distinguish 
the resulting densities from those that will be used in the MED formu- 
lation, here we will put a P symbol on the Bayesian distributions. In 
Bayesian inference, the posterior for the positive class is estimated only 
from the positive input examples {X \+ : . . . , ^t+} 



P(0+) = P{0+ |{X 1+ ,...,X T +}) oc P{{X l +,...,X T+ }\0+)P{0 + ) 
oc Y[P(x t+ \e + )p°(e + ) . 
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Similarly, the posterior for the negative class is is estimated only from 
the negative examples {Xi_, . . . , Xp~} 

P(6~) oc J]P(X t _|0_)P°(0_). 

t- 

For this Bayesian generative estimate, a minimally informative prior 
P°(0±) may be used. The result is a distribution that is as good a gen- 
erator as possible for the data set. However, we don’t want just a good 
generator of the data, we also want a good discriminator. Thus, we can 
use MED to satisfy the large-margin classification constraints. Simul- 
taneously, the solution should be as close as possible to the generative 
model in terms of KL-divergence. One interesting choice for the MED 
priors is are Bayesian posteriors themselves 

P°(0+) = P(0+) P°(6L) = P(6L). 

Figure 3.6 depicts the information projection solution MED will gener- 
ate from the Bayesian estimate. Effectively, we will try solving for the 
distribution over parameters that is as close as possible to the Bayesian 
estimate (which is often actually quite similar to the maximum likelihood 
estimate in the case of the exponential family) but that also satisfies the 
classification constraints. 



BAYES 




Figure 3.6. Information Projection Operation of the Bayesian Generative Estimate. 

The motivation here is that in the absence of any further discrimina- 
tive information, we should have as good a generator as possible. We 
now note a number of advantages for this type of empirical Bayes prior, 
including theoretical, conceptual and practical arguments. We suggest 
an empirical Bayes prior because it allows more flexible use of MED’s 
discriminative model as a generator whenever necessary. This may be 
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the case when the discriminator has to cope with missing data or noise. 
If we are in a prediction setting where some input variables are missing, 
we could reconstruct them (or integrate over them) by simply using the 
MED discriminative model as a surrogate for a generator distribution. 

When the data is sparse, a model may easily satisfy some given dis- 
crimination constraints and many aspects of the model could remain 
ambiguous. The empirical Bayesian prior provides a backup generative 
criterion, which further constrains the problem (albeit in ways not help- 
ful to the task) and therefore can help consistent estimation. We also 
obtain invariance when using an empirical Bayes prior which we would 
otherwise not have if we assume a fixed prior. For example, a fixed 
zero-mean Gaussian prior would produce different MED solutions if we 
translate the training data while an empirical Bayes prior would follow 
the translation of the data (with the Bayesian generative model) and 
consistently set up the same relative decision boundary. 

Furthermore, consistency is important in the (unrealistic) situation 
that the generative model we are using is exactly correct and perfectly 
matches the training data. In that case, the Bayesian solution is optimal 
and MED may stray from it unless we have an empirical Bayes prior, 
even if we obtain infinite training data. An interesting side note is that if 
we use the standard margin prior distribution given by Equation 3.9, and 
obtain an upper bound on the Lagrange multipliers (i.e. they are less 
than c), then as c — > 0, the MED solution uses the Bayesian posterior, 
while as c increases, we reduce regularization (and outlier rejection) in 
favor of perfect classification. 

Finally, on a purely practical note, an empirical Bayes prior may pro- 
vide better numerical stability. A discriminative MED model could put 
little probability mass on the training data and return a very poor gener- 
ative configuration while still perfectly separating the data. This would 
be undesirable numerically, since we would get very small values for 
P(X\Q+) and P(X\6-). During prediction, a new test point may cause 
numerical accuracy problems if it is far from the probability mass in the 
MED discriminative solution. Therefore, whenever it does not result in a 
loss of discriminative power, one should maintain the generative aspects 
of the model. 

9,3 Full Covariance Gaussians 

We now consider the case where the discriminant function £(X;0) 
corresponds to the log-likelihood ratio of two Gaussians with different 
(and adjustable) covariance matrices. The parameters 0 in this case 
are both the means and the covariances. These generative models are 
within the exponential family so the previous results hold. This leads 
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us to choose the prior P°(0) that is conjugate to the Gaussian, namely 
the Normal-Inverse- Wishart as discussed in Chapter 2. We shall use J\f 
as shorthand for the normal distribution and ZW as shorthand for the 
inverse- Wishart. This choice of distributions permits us to obtain closed 
form integrals for the partition function Z( A). We once again break down 
the parameters into the two generative models and the bias. Therefore, 
we have P(0) = P(0+)P(0_)P(6). More specifically, the 9± will also be 
broken down into the mean and covariance components of the Gaussian. 
Therefore, we have: P(0) = P(/i+, £+)P(/i_, S_)P(6) which gives us a 
density over means and covariances (this notation closely follows that of 
[125]). We choose the prior distribution to have the following conjugate 
form: 



P°(0+) = U(ji+\m + ,H + /k) l\V{X+\kV + ,k) . 

Here, several parameters specify the prior, namely the scalar fc, the 
vector ?7i+ , and the matrix V+ can be imputed manually. Also, one may 
let k — > 0 to get a non-informative prior. 

We used the MAP values for /c, ra° and V° from the class-specific data, 
which corresponds to the posterior distribution over the parameters 
given the data under a Bayesian inference procedure (i.e. an empirical 
Bayes procedure as described in the previous section). Integrating over 
the parameters, we get the partition function Z( A) = Z 7 (A)Z+(A)Z_(A). 
For Z+( A) we obtain 



Z+(A) 



oc 



N 



-d/2 



ks+| 



-N+/2 



d 

nr 

3 = 1 



N+ + l-j 



where we have defined 



N + ± £> t 

t 




S + = Y. Wi x t x J -N+X + Xl. 



Here, wt is a scalar weight given by wt = u(yt) +yt^t f° r Z+(A). To solve 
for Z_( A) we proceed in exactly the same manner as above, however, 
the weights are set to w t = u(— y t ) — y t X t . The function u(-) is merely 
the step function where u(x) = 1 for x > 0 and u(x) = 0 otherwise. 
Given Z, updating A is done by maximizing the corresponding negative 
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entropy J( A) subject to 0 < X t < c and hyt — 0, where: 

JW = + Vc)]-logZ+(A t )-logZ_(Ai). 

t 

The potential term above corresponds to integrating over the margin 
with a margin prior P°( 7 ) oc e -c (k*- 7 ) with 7 < s. We pick l a to be 
some cy-percentile of the margins obtained under the standard MAP 
solution. 

Optimal Lagrange multiplier values are then found via a simple con- 
strained gradient descent procedure. The resulting MRE (normalized by 
the partition function Z(X)) is a Normal- Wishart distribution itself for 
each generative model with the final A values set by maximizing J( A): 

P(0 + ) = Af( f i +] X+,X + /N + ) 1W(X + -,S + ,N + ). 



Predicting the labels for a data point X under the final P(0) involves 
taking expectations of the discriminant function under a Normal-Wishart. 
For the positive class, this expectation is: 

E P(0+) [log -F > (J5C|6» + )] = constant - ^-(X - X + ) T S^(X - X + ) . 

The expectation over the negative class is similar. This gives us the 
predicted label simply as 

y = sign J P( 0 )£(X; 0 )d@ . 

We can expand the above further to obtain 



y 



. ^ r p{x\o+) ■ 

- sign E P{@) pog p-(x\e-) + b 
= si g n ( E P(ff+) [log P{X\0+)} 



rn. 






n n ( v \n 



Computing the expectation over the bias is avoided under the non- 
informat ive case and the additive effect it has is merely estimated, as 
in an SVM via the Karush-Kuhn- Tucker conditions. In other words, 
whenever the Lagrange multipliers are non-zero and less than the upper 
bound c, the classification constraint inequalities for those data points 
are achieved with equality. 

Ultimately, through the log-ratio of these two Gaussian models we ob- 
tain discriminative quadratic decision boundaries. These extend the lin- 
ear boundaries without (explicitly) resorting to kernels. Of course, ker- 
nels may still be used in this formalism, effectively mapping the feature 
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0* 



(a) 

Figure 3.7. Classification visualization for Gaussian discrimination. The maximum 
liklihood solution is shown in (a) and is used to initialized the MED optimization. An 
intermediate step in the MED solution is shown in (b) and the final MED solution 
is shown in (c). In the first row we see the decision boundary from two Gaussians 
with different covariances and means. In the second row, we see the actual Gaussian 
probabilities for each class. Note how the maximum likelihood solution places the 
Gaussians to match the mean and covariance of each class and obtains a poor clas- 
sifier as a result. The MED solution, on the other hand, finds the configuration of 
the Gaussians that obtains the largest margin decision boundary possible, improve 
discrimination with the same generative model. 




p a 

p |§j & 

'' m. 




space into a higher dimensional representation. However, in contrast to 
linear discrimination, the covariance estimation in this framework allows 
the model to adaptively modify the kernel. 

For visualization, we present the technique on a 2D set of training 
data in Figure 3.7. In Figure 3.7(a), the maximum likelihood technique 
is used to estimate a two Gaussian discrimination boundary (bias is 
estimated separately) which has the flexibility to achieve perfect classi- 
fication yet produces a classifier whose performance is equal to random 
guessing. Meanwhile, the maximum entropy discrimination technique 
places the Gaussians in the most discriminative configuration as shown 
in Figure 3.7(b) without requiring kernels or feature space manipula- 
tions. 

In the following, we show results using the minimum relative entropy 
approach where the discriminant function £(X, 0) is the log-ratio of 
Gaussians with variable covariance matrices on standard two class clas- 
sification problems (Leptograpsus Crabs and Breast Cancer Wisconsin). 
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Performance is compared to regular support vector machines, maximum 
likelihood estimation and other methods. 



Method 


Training 

Errors 


Testing 

Errors 


Neural Network (1) 




3 


Neural Network (2) 




3 


Linear Discriminant 




8 


Logistic Regression 




4 


MARS (degree = 1) 




4 


PP (4 ridge functions) 




6 


Gaussian Process (HMC) 




3 


Gaussian Process (MAP) 




3 I 


SVM - Linear 


5 


3 


SVM - RBF a = 0.3 , 


1 


18 


SVM - 3rd Order Polynomial 


3 


6 


Maximum Likelihood Gaussians 


4 


7 


MaxEnt Discrimination Gaussians 


2 


3 



Table 3.2. Leptograpsus Crabs Classification Results. 



The Leptograpsus crabs data set was originally provided by Ripley 
[155] and further tested by Barber and Williams [8]. The objective is 
to classify the sex of the crabs from 5 scalar anatomical observations. 
The training set contains 80 examples (40 of each sex) and the test set 
includes 120 examples. 

The Gaussian based decision boundaries are compared in Table 3.2 
against other models from[8]. The table shows that the maximum en- 
tropy (or minimum relative entropy) criterion improves the Gaussian 
discrimination performance to levels similar to the best alternative mod- 
els. The bias was estimated separately from training data for both the 
maximum likelihood Gaussian models and the maximum entropy dis- 
crimination case. In addition, we show the performance of a support 
vector machine (SVM) with linear, Gaussian RBF and polynomial ker- 
nels (using the Matlab SVM Toolbox provided by Steve Gunn). In this 
case, the linear SVM is limited in flexibility, while other kernels exhibit 
some over-fitting. 

Another data set which was tested was the Breast Cancer Wisconsin 
data where the two classes (malignant or benign) have to be computed 
from 9 numerical attributes describing the tumors (200 training cases 
and 169 test cases). The data was first presented by Wolberg [199]. 
We compare our results to those produced by Zhang [204] who used 
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a nearest neighbor algorithm to achieve 93.7% accuracy. As can be 
seen from Table 3.3, over-fitting prevents good performance from the 
kernel based SVMs and the top performer here is the maximum entropy 
discriminator with an accuracy of 95.3%. 



Method 


Training 

Errors 


Testing 

Errors 


Nearest Neighbor 




11 


SVM - Linear 


8 


10 


SVM - RBF a = 0.3 


0 


11 


SVM - 3rd Order Polynomial 


1 


13 


Maximum Likelihood Gaussians 


10 


16 


MaxEnt Discrimination Gaussians 


3 


8 



Table 3.3. Breast Cancer Classification Results. 



9.4 Multinomials 



Another popular exponential family model is the multinomial dis- 
tribution. We next consider the case where the discriminant function 
£(X;0) corresponds to the log-likelihood ratio of two multinomials: 



£(X;0) 



log 



P(X I'M 

P(X \6-) 



-b 



where we have the generative models given by (if the X vector is consider 
as a set of counts): 



p(x\e + ) 




(3.11) 



In the above, we are using the superscript on X k to index the dimen- 
sionality of the vector (the subscript will be used to index the training 
set). The scalar term in the large parentheses is the multinomial coeffi- 
cient (the natural extension of the binomial coefficient from coin tossing 
to die tossing). This term is unity if X is zero everywhere except for 
one unit entry. Otherwise, it simply scales the probability distribution 
by a constant factor which can be rewritten as follows for more clarity 
(the use of gamma functions permits us to also consider continuous X 
vectors): 

£{L,x* A = (££-i x<, ) ! Uh-d*,,**) 

x '- xK > nf.iX‘1 nL,r(i+x‘r 
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The generative distribution in Equation 3.11 parameterizes the multi- 
nomial with the p vector of non-negative scalars that sum to unity, i.e. 
Y2 p = 1. The parameters 0 in this case are both the p for the positive 
class and the negative class as well as the bias scalar b. These gen- 
erative models are within the exponential family and so the results of 
Section 9.1 should hold. Thus, the prior we choose must be the conju- 
gate to the multinomial which is the Dirichlet distribution. This choice 
of distributions permits us to obtain closed form integrals for the parti- 
tion function Z( A). Here, we shall once again break down the parameters 
into the two generative models and the bias as before. Wee then have 
P(0) = P(0 + )P(0_)P(6), to distinguish the p for the positive class, 
we will denote the parameters for the negative class as p. The prior 
Dirichlet distribution has the form 



P°(6 + ) 



r (Efc «*) 

n*r(a*) 



rrr 1 



We typically assume that a & will be pre-specified manually (or given by 
an empirical Bayes procedure) and will satisfy > 1. The core compu- 
tation involves computing the component of the log-partition function 
that corresponds to the model (the computation for the bias and the 
margins remain the same as all the previous cases). One component we 
need is 

Ze + (\)Z e _(\) = / P o (0 + ) J P o (0_)e Et Xtm t log toH] dO+dO- . 

It suffices to show how to compute Zq + : 

Zo + 



We can thus form our objective function and maximize it to obtain 
the setting for the Lagrange multipliers A (subject to the constraint 



J P 0 (d + )e^ Xtytlog P { Xt\ e +)d6+ 

f r (Efc a k) TT TT XtVtX t 

i Pw 1 *' 1 V* 



dp e 



= J 



r (Efc a k) rr 



dp e \ f t 



__ a k) Uk + Y^t ^tytXj?) 

r(E* ak + Et* >*vtx£) n* r («fc) 



Xh lo s 



T.k x t 

x t k ~ x t k 
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Et A tyt = °) : 

J( A) = - log Z 9+ ( A) - log (A) - log Z 7 ( A) . 

The setting for the Lagrange multipliers permits us to exactly specify 
the final MED solution distribution P(O) which is used to compute the 
predictions for future classification: 

y = sign J P(&)C{X ; 0)d0 . 

10. Generalization Guarantees 

We now present several arguments for MED in terms of generalization 
guarantees. While generative frameworks have guarantees (asymptotic 
and otherwise) on the goodness of fit of a distribution (i.e. Bayesian 
evidence score, Bayesian Information Criterion, Akaike Information Cri- 
terion, etc.), they seldom have guarantees on the generalization perfor- 
mance of the models in a classification or regression setting. Further- 
more, the guarantees may be distribution dependent which might be 
inappropriate if the true generative distribution of a data source is not 
perfectly known. Conversely, discriminative approaches that specifically 
target the classification or regression performance can have strong gen- 
eralization arguments as we move from training data to testing data. 
These may also be distribution independent. The MED framework, in 
its discriminative estimation approach, brings classification performance 
guarantees to generative models. There are a number of arguments 
we will make, including sparsity-based generalization, references to VC- 
dimension based generalization and PAC-Bayesian generalization. More 
recent work in generalization bounds based on Rademacher averages [10] 
as well as stability arguments [19] may also eventually prove useful for 
the MED framework. Although generalization bounds can be quite loose 
for small amounts of training data, they are better than no guarantees 
whatsoever. Furthermore, the shape of the generalization bounds has 
been found to be of practical use in discriminative learning problems. 
Finally, for large amounts of data, these bounds may eventually become 
reasonably tight. 

10.1 VC Dimension 

Due to the ability of the MED framework to subsume SVMs (exactly 
generating the same equations in the separable case), it also admits 
their generalization guarantees. These are of course the VC-dimension 
(Vapnik-Chervonenkis) bounds on the expected risk, 7^(0), of a classi- 
fier. Assuming we have a [0,1] loss function and T training 
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examples, the empirical risk can be readily computed [28, 187, 188]: 

1 T 

W0) = 

t= 1 

The true risk (for samples outside of the training set) is then bounded 
above by the empirical plus a term that depends only on the size of 
training set, T, and the VC-dimension of the classifier, h. This non- 
negative integer quantity measures the capacity of a classifier and is 
independent of the data. The following bound holds with probability 
1-5 : 



TZ(Q) < 7^emp(0) + 



h(log(2T/h) + 1) - log(*/4) 
T 



As in Chapter 2 the VC-dimension of a set of hyperplanes in W D is 
D + 1. This does not directly motivate the use of large margin decision 
boundaries. However, an SVM can be interpreted as a gap-tolerant 
classifier instead of a pure hyperplane. The VC-dimension of such a 
classifier is upper bounded by 



h < 



min 




rri 2 



D 



+ 1 . 



where m is the margin of the gap-tolerant classifier, d is the diameter of 
the sphere that encloses all the input data and D is the dimensionality 
of the space. Thus, we have a plausible argument for maximizing margin 
with a linear classifier. Although this does not translate immediately to 
nonlinear classifiers (if there is no direct kernel mapping back to linear 
hyperplanes), the motivation for large-margins in SVMs justifies using 
large margins in the MED formulation (namely with priors that put 
large probability mass on larger margins values). We now move on to 
other formal arguments for MED generalization. 



10.2 Sparsity 

The MED solution involves a constraint-based optimization where a 
classification constraint is present over each training data point to be 
classified. Each constraint is represented by the Lagrange multiplier as- 
sociated with the given data point. In many cases, these constraints are 
likely to be redundant. This is apparent since classifying one data point 
correctly might automatically result in correct classification of several 
others. Therefore, the constraints involving some data points will be 
obviated by others and their corresponding Lagrange multipliers will go 
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to zero. As in an SVM, points close to the margin (which have small 
margin values) have a critical role in shaping the decision boundary and 
generate non-zero Lagrange multipliers. These are the support- vectors 
in SVM terminology. Meanwhile, other points that are easily correctly 
classified with a large margin will have zero Lagrange multipliers. Thus, 
the MED solution only depends on a small subset of the training data 
and will not change if the other data points were deleted. This gives 
rise to a notion of sparsity leading to more generalization arguments. 
One argument is that the generalization error (denoted e g ) is less than 
the expected percentage (ratio) of non-zero Lagrange multipliers over all 
Lagrange multipliers. 



t9 < s >0) 1 , 

For T data points, we simply count the number of non-zero Lagrange 
multipliers (using the 8 function which is zero for Lagrange multipliers of 
value zero and unity for non- vanishing values). However, the expectation 
is taken over arbitrary choices of the training set which means that the 
upper bound on generalization error can only be approximated (using 
cross-validation or other techniques as in [187, 78]). Alternatively, a 
coarser and riskier approximation to the expectation can be done by 
simply counting the number of remaining non-zero Lagrange multipliers 
after maximizing J( A) on the training set in the MED solution. 

10.3 PAC-Bayes Bounds 

An alternative to VC dimension arguments for generalization come 
from recent extensions to PAC bounds (probably approximately cor- 
rect, Valiant 1984) called PAC-Bayesian methods. PAC-Bayesian model 
selection criteria developed by McAllester [118] and Langford [109] have 
given theoretical generalization arguments that directly motivate the 
MED approach (MED was actually developed independently of these 
generalization results). Essentially, PAC-Bayesian approaches allow the 
combination of a Bayesian integration of prior domain knowledge with 
PAC generalization guarantees without forcing the PAC framework to 
assume the truthfulness of the prior. We state the main results here 
but, for additional details, the reader shoul refer to the original works 
[109, 118]. Effectively, the generalization guarantees are for model av- 
eraging where a stochastic model selection criterion is given in favor of 
a deterministic one. MED is a model averaging framework in that a 
distribution over models is computed (unlike, for instance, an SVM). 
Therefore, these new generalization results apply almost immediately. 
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First, as in MED we assume a prior probability distribution P°(0) 
over a possibly uncountable (continuous) model class. We also assume 
our discriminant functions £(X;0) are bounded real- valued hypothe- 
ses 7 . Given a set of T training exemplars of the form (X t ^y t ) sampled 
from a distribution D, we would like to compute the expected loss (i.e. 
the expected fraction of misclassifications). Recall that, in MED, correct 
classification of the data is given by: 

y t J P(@)C(X t -,&)d@ > 0, 

while incorrect classifications are of the form: 

y t J P(Q)C(X t -&)d& < 0. 

A more conservative empirical misclassification rate (i.e. which over- 
counts the number of errors) can be made by also counting as errors 
those examples whose classification value falls below some positive mar- 
gin threshold 7: 



Vt J p(e)C(x t -e)d& < 7 - 

If we compute the empirical number of misclassifications with this more 
conservative technique based on the threshold, 7, we can upper bound 
the expected (standard) misclassification rate. The expected misclassifi- 
cation rate has the following upper bound, which holds with probability 
1—5: 



Ed y J P{e)C(X;Q)de < 0 < ^ £ y t J P(e)C(X t -&)dQ 



< 7 



+o 



j- 2 KL(P(&)\\P°(e)) In T + InT + lntf" 



Ideally, we would like to minimize the expected risk of the classifier 
on future data (left hand side). Clearly, the bound above motivates 
forming a classifier that satisfies the empirical classification constraints 
(encapsulated by the first term on the right hand side), while minimizing 
divergence to the prior distribution (the second term on the right hand 
side). We also note that increasing the margin threshold is also useful 
for minimizing the expected risk. These criteria are directly addressed 
by the MED framework, which strongly agrees with this theoretical mo- 
tivation. Furthermore, increasing the number of training examples will 
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make the bound tighter independently of the distribution of the data. 
Another point of contact is that [118] argues that the optimal poste- 
rior according to these types of bounds is, as in the maximum entropy 
discrimination solution, the Gibbs distribution. 
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Notes 

1 At the risk of misquoting what Ockham truly intended to say, we shall 
use this quote to motivate the sparsity which arises from a constraint- 
based discriminative learner such as the maximum entropy discrimi- 
nation formalism. 

2 It should be noted that regularization theory is not limited to margin- 
based concepts. In general, the penalty function or stabilizer terms 
may depend on many other regularization criteria through a wide 
area of possible norms and semi- norms. One interpretation of reg- 
ularization theory is to regard it as an approach to solving inverse 
problems. It spans applications from spline-fitting to pattern recog- 
nition and employs many sophisticated mathematical constructs such 
as reproducing kernel Hilbert spaces [48]. 

3 In practice, other (possibly parametric) restrictions may arise on P(&) 
that prevent us from using arbitrary delta functions in this manner. 

4 At this point we have assumed that the margins 7 1 and their loss 
functions are held fixed (these are typically set to 7 $ = 1 Vt). This 
assumption will be relaxed subsequently. 

5 Equality constraints in the primal problem would generate Lagrange 
multipliers that are arbitrary scalars in (— 00 , 00 ) 

6 Thanks to David Gondek for pointing out a mistake in the original 
version of one of the derivations. 

7 The generalization guarantees were originally proposed for averaging 
binary discriminant functions, not real ones, but can be extended in a 
straightforward manner. One may construct an MED classifier where 
the discriminant function is, for instance, sigmoidal or binary, and 
then satisfy the requirements for these bounds to hold. Alternatively, 
a trivial extension is to find a bound by considering a maximal sphere 
around all the data which implicitly provides limits on the range of 
the discriminant function. This then permits a scaled version of the 
generalization bound. 




Chapter 4 



EXTENSIONS TO MED 



Each problem that I solved became a rule which served afterwards to 

solve other problems. 

Rene Descartes, 1596-1650 



In the previous chapter we saw how the MED approach to learning 
can combine generative modeling with discriminative methods, such as 
SVMs. In this chapter we explore extensions of this framework spanning 
a wide variety of learning scenarios. One resounding theme is the intro- 
duction of further (possibly intermediate) variables in the discriminant 
function C(X ; 0), and solving for an augmented distribution P(0, . . .) 
involving these new terms (Figure 4.1). The resulting partition func- 
tion typically involves more integrals, but as long as it is analytic, the 
number of Lagrange multipliers and the complexity of the optimization 
will remain basically unchanged, as when we introduced slack variables 
in Section 3. Once again, we note that as we add more distributions to 
the prior, we must be careful to balance their competing goals (i.e. their 
variances) evenly so that we still derive meaningful information from 
each component of the aggregate prior (the model prior, the margin 
prior, and the many further priors we will introduce shortly). 

Figure 4.2 depicts the many different scenarios that MED can han- 
dle. Some extensions such as multi-class classification can be treated as 
multiple binary classification constraints [80] or through error-correcting 
codes [41]. In this chapter we explicate the case where the labels are no 
longer discrete but continuous, as in regression. Once again (as in binary 
classification), we find that MED subsumes SVM regression. Following 
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that, we discuss structure learning (as opposed to parameter estima- 
tion), and in particular, feature selection. This leads to the more gen- 
eral problems of kernel selection and meta-learning, where multi-task 
SVM models share a common feature or kernel selection configuration. 
We also discuss the use of partially labeled examples and transduction 
(for both classification and regression). Finally, we arrive at a very im- 
portant generalization which requires special treatment on its own: the 
extension to mixture models (i.e. mixtures of the exponential family) 
and latent modeling (discussed Chapter 5). 

1. Multiclass Classification 

There are several different approaches to extending binary classifi- 
cation to multi-class problems (for example, error correcting output 
codes [41]), each with its own benefits and drawbacks. 

It is straightforward to perform multi-class discriminative density esti- 
mation by adding extra classification constraints. For T input points, the 
binary case merely requires T inequalities of the form: yt C(Xt ; 0) — 7 1 > 
0. In a multi-class setting, constraints are needed based on the pairwise 
log-likelihood ratios of the generative model of the correct class and 
that of all the other classes. In other words, in a three-class problem 
(A, 13, C) with three models (9 a, 9b, 9c), if yt — A, the log-likelihood of 
model 9a must dominate. This leads to the following two classification 
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(a) Binary Classification 




(d) Feature Selection 




(b) Multiclass Classification 





(c) Regression 


J 









(f) Anomaly Detection 



Figure 4-2. Various extensions to binary classification. 



constraints: 



I P(G,7) 

I m 7) 



log 

log 



P(Xt\e A ) 

P(Xt\0 B ) 

P(X t \e A ) 

P{X t \0 c ) 



+ bAB - 7 



+ bAC - 7 



d@ d'y > 0, 



d@ dj > 0. 



More generally, for m class-conditional probability models, we have 
P(X\0y) for y = 1, . . . ,m as well as the class frequencies p y which are 
just non-negative scalars. Each discriminant function is then a pairwise 
comparison: 



£y,y(X\e) = 



i 0g 7 7W +loe a. 

S P(X\H , /S } ' ^ p, f < 



where the parameters include all individual class-conditional models and 
frequencies, 0 = {0i, . . . , . . . ,p m }- Hence, for each data point, 

we have a total of m — 1 constraints comparing the class-conditional 
probabilities of the correct class to all the incorrect classes: 



/ 



^(©> 7 ) [£y t ,v(Xt\&) ~j]dQdj > 0 Vyj^y t . 
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2. Regression 

The MED formalism is not restricted to classification. We now present 
its extension to regression (or function approximation) using the ap- 
proach and nomenclature of [172]. We impose two-sided constraints 
on the output, forming a so-called e-tube around the discriminant. An 
e-tube is often used in SVM regression and corresponds to a region of in- 
sensitivity in the loss function. It only penalizes regression errors which 
deviate by more than e from the desired output values. Suppose that 
training input examples X \, . . . , Xt are given with corresponding con- 
tinuous scalars outputs t/i, • Vt- We wish to solve for a distribution over 
the parameters of a discriminative regression function as well as margin 
variables: 



THEOREM 4.1 The maximum entropy discrimination regression prob- 
lem can be cast as follows : 

Find P(@, 7 ) that minimizes KL(P\\P°) subject to the constraints: 



I P(Q,y)\y t -C(X t -,Q) + lt ] dQdj 
I P(0, 7 ) [ 7 j — y t + £(X t ; 0)] d@d^ 



> 0, t = 1...T 

> 0, t = 1...T 



where C(Xt] 0) is a discriminant function and P° is a prior distri- 
bution over models and margins. The decision rule is given by y = 
f P(0) C(X;Q)d& and the solution of the MED problem is 

1 „ e 12tMyt-c(Xt\Q)+yt} 

Pie ’^ = W)^ ie ’ j) e E, >-,[», -^,|6)-,i) 

where the concave objective function is J( A) = — logZ(A). 



Typically, we have the following prior for 7 , which differs from the 
classification case in the additive (versus multiplicative) role of the out- 
put yt and the presence of two-sided constraints: 



jdOi \ / 1 if 0 < 74 < e 

P (7 1 ) OC | e c( £ - 7t ) if 7t > e 



(4.1) 
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(a) (b) (c) 



Figure 4-3. Margin prior distributions (top) and associated potential functions (bot- 
tom). 



Integrating, we obtain: 



l°g Z lt {Xt) = log 



re roo 

J + J e c ( e_ 7«) e ^t7* dj t 



/ e A« 1 e \e 

= log T“a + Xx 






= eX t - log (A*) + log ( 1 - e Ate H t — 

\ C At 



Figure 4.3 shows the above prior and its associated penalty terms under 
different settings of c and e. Varying e modifies the thickness of the 
e-tube around the function. Meanwhile, c controls the robustness to 
outliers by tolerating violations of the e-tube. 

This margin prior tends to produce a regressor which is insensitive to 
errors smaller than e and thereafter penalizes errors by an almost linear 
loss (c controlling the steepness of the linear loss). 



2.1 SVM Regression 

Assuming that £ is a linear discriminant (or linear after a kernel 
mapping), the MED formulation returns the same objective function as 
SVM regression [172]: 
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Figure 4-4- MED approximation to the sine function: noise-free case (left) and with 
Gaussian noise (right). 



THEOREM 4.2 Let C be a linear discriminant function £(X; 0) — d T X+ 
b and P°(@, 7 ) = P°{0)P°{b)P 0 ( 7 ) be a prior with P°{9 ) - Af(6;0J), 
P°(b) approaching a non-inf ormative prior, and P°( 7 ) as given by Equa- 
tion 4-1- The Lagrange multipliers X are obtained by maximizing J( A) 
subject to 0 < Xt < c, 0 < X f t < c and Y2t At = Y^t K> where 



^(A) — A't - At) - e^(A t + Aj) + ]Plog(A*) + ^log(Aj) 

t t t t 

- log (1 - .-*•< + - log (l - 



We see that as c — » 00 , the objective becomes similar to the one in 
SVM regression. There are some additional penalty functions (all the 
logarithmic terms) which can be considered as barrier functions in the 
optimization to maintain the constraints. 

As an illustration, we approximate the sine function, a popular exam- 
ple in the SVM literature. Here we sampled 100 points from sinc(a;) = 
\x\~ l sin \x\ in the interval [—10, 10]. We also considered a noisy version 
of the sine function where Gaussian additive noise of standard deviation 
0.2 was added to the output. The result, shown in Figure 4.4, is similar 
to the function approximation we would get from SVM regression. The 
kernel applied here was an 8 th order polynomial. 
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2.2 Generative Model Regression 

As we did for classification, now consider MED regression with non- 
linear models. The regularization and e-tube properties of the SVM 
approach can be readily applied to the estimation of generative models 
in a regression setting. The model can be any member of the exponential 
family or their mixtures, as we shall see in the next chapter. We begin 
by modifying the discriminant function £(X;0) from its usual linear 
form. 

Consider a two class problem where we have a generative model for 
each class, P(X\0+) and P(X\6-). For example, each could be be a 
Gaussian distribution, a mixture of Gaussians, or even a complex struc- 
tured model, such as a hidden Markov model. To form a regressor, we di- 
rectly combine two generative models by considering their log-likelihood 
ratios into a discriminant function 



£(*;©)= log 



P(X\0+) 

P(X\9-) 



+ b. 



Here the aggregate parameter set is 0 = (#4_,0_,6) which includes both 
generative models and a scalar bias. Thus, by merely changing the dis- 
criminant function, the MED framework can be used to estimate gen- 
erative models that form a regression function. If the two generators 
are Gaussians with equal covariance, the regression function will pro- 
duce a linear regression. However, the above discriminant function is 
generally nonlinear. In a traditional regularization setting it will give 
rise to non-convex constraints. However, in the MED framework, due 
to the probabilistic nature of the solution P(0), the above discriminant 
functions will still produce a convex program with a unique solution. 



3. Feature Selection and Structure Learning 

The MED framework is not limited to estimating distributions over 
continuous parameters such as 0. We can also use it for structure learn- 
ing by solving for a distribution over discrete parameters. One form 
of structure learning is feature selection. The feature selection problem 
can be cast as finding the structure of a graphical model (as in [39]) or 
identifying a set of components of the input examples that are relevant 
for a classification task. More generally, feature selection can be viewed 
as a problem of setting discrete structural parameters associated with a 
specific classification or regression method. We will use feature selection 
in the MED framework to ignore components of the input space (i.e. the 
Xt vectors) that are not relevant to the given classification or regression 
task. Not only does feature selection appreciably reduce the run time 
of many algorithms, it often also improves generalization performance, 
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both in classification and regression [103]. The omission of certain input 
dimensions notion of sparsity in the input dimensions (in addition to 
the sparsity from the few support vectors and Lagrange multipliers that 
emerge in regular support vector machines) [86, 195]. This is often crit- 
ical when the input space has high dimensionality with many irrelevant 
features and the data set is small. 

We first cast feature selection in MED as a feature weighting scheme 
to permit a probabilistic approach. Each feature or structural parameter 
is given a probability value. The feature selection process then simulta- 
neously estimates the most discriminative probability distribution over 
the structural parameters and the most discriminative parameter mod- 
els. Irrelevant features will eventually receive extremely low selection 
probabilities. Since the feature selection process is performed jointly 
and discriminatively together with model estimation, and both specif- 
ically optimize a classification or regression criterion, feature selection 
will usually improve results over, for example, an SVM (up to a point 
where we start removing too many features). 

3.1 Feature Selection in Classification 

The MED formulation can be extended to feature selection by aug- 
menting the distribution over models (and margins, bias, etc.) to a dis- 
tribution over models and feature selection switches. This augmentation 
paradigm preserve the solvability of the MED projection under many 
conditions. We will now consider augmenting a linear classifier (such 
as an SVM) with feature selection. We first introduce extra parameters 
into our linear discriminant function: 

n 

C(X-,<d) = Y,0iSiXi + 0 O'. 

i= 1 

Here the familiar 6 \ , . . . , 0 n correspond to the linear parameter vector 
while 9q is a bias parameter (sometimes denoted b). We have also in- 
troduced binary switches si, . . . , s n which can only be 0 or 1. These are 
structural parameters and will either completely turn off a feature Xi 
if Si = 0 or leave it on if Si = 1. Trying to find the optimal selection 
of features in a brute force way, would mean exploring all 2 n configu- 
rations of the discrete switch variables. In the MED formulation, we 
can instead consider a distribution over switches making the computa- 
tion tractable. The discrete nature of the switches does note violate 
the MED formulation [80]. The partition function and the expectations 
over discriminant functions also involve summations over the Si as well 
as integration over the continuous parameters. The MED solution dis- 




Extensions to MED 



107 



tribution P(©) is then a distribution over the linear model parameters, 
the switches and the bias, i.e. 0 = ( 0q , 0i , . . . , 6 n , s \, . . . , s n ). 

We will now define a prior over the desired MED solution and then 
discuss how to solve for the optimal projection. The prior will reflect 
some regularization on the linear SVM parameters as well as the overall 
degree of sparsity we want to enforce. In other words, we would like to 
specify (in coarse terms) how many feature switches will be set to zero. 
One possible prior is 



p°(0) = p°(0 o )p°(0) f[P°( Si ) 

i = 1 

where P°(0o) is an uninformative prior on the bias, i.e., a zero mean 
Gaussian prior with infinite variance. An alternative choice for the bias 
prior is a finite-variance Gaussian which will give a quadratic penalty 
term on the final objective function of A tyt) 2 instead of the hard 

equality constraint. In addition, we have P°(6) = A/"(0;O, I) the usual 
white noise Gaussian prior for the model parameters, and a prior on the 
switches given by 



P°(„) = *>'<(!- A) 1 -’ 1 



where p controls the overall prior probability of including a feature. 
Thus, the prior over each feature is merely a Bernoulli distribution. 
Setting p = 1 will produce the original linear classifier problem without 
feature selection. By decreasing p, more features will be removed. Given 
a prior distribution over the parameters in the MED formalism and a 
discriminant function, we can now readily compute the partition function 
(cf. Equation 3.8). Solving the integrals and summations, we obtain the 
objective function 



J(A) 



n 



J^[A ( + log(l - A t /c)] - ^ log 

t i = 1 



\ — p _|_ pg 2 *tvtX t ,i) 



which we maximize subject to ^ A tyt = 0. 

The above is maximized to obtain the optimal setting of our Lagrange 
multipliers. Given that setting, our linear classifier becomes simply 



cm = E 

i 



p Yst 

p + (i - p) ex p(~i/2[E( hytX t ,i] 2 ) 



x { + b. 



Here X^i indicates the i’th dimension of the t’th training set vector. The 
bias b is estimated separately either from the Kuhn- Tucker conditions 
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Figure 4-5. ROC curves on the splice site problem with feature selection p = 0.00001 
(solid line) and without p = 0.99999 (dashed line). 



(under a finite variance bias prior), or is set to b = a \tyt- The terms 
in the large parentheses in the equation above are the linear coefficients 
of the new model and can be denoted C;. 

We tested this linear feature selection method on a DNA splice site 
classification problem, where the task is to distinguish between true 
and spurious splice sites. The examples were DNA sequences of fixed 
length (25 nucleotides), with a four bit binary encoding of {A, C,T, G}, 
giving 100-element vectors. The training set had 500 examples and the 
test set had 4724. Results are depicted in Figure 4.5 showing superior 
classification accuracy when feature selection is used, as opposed to no 
feature selection, which is roughly equivalent to an SVM. 

Feature selection aggressively prunes the features, driving many of 
the linear model’s coefficients to zero, leading to improved generaliza- 
tion performance as well as faster run-times. To picture the sparsity of 
the resulting model, we plot the cumulative distribution function of the 
magnitudes of the coefficients \C{\ <xasa function of x for all 100 com- 
ponents of the linear classification vector. Figure 4.6 shows that most 
of the weights resulting from the feature selection algorithm are indeed 
small enough to be neglected. 

We can extend feature selection to kernel-based nonlinear classifiers 
by mapping the feature vectors explicitly into a higher dimensional rep- 
resentation (e.g. through polynomial expansions). This does not retain 
the efficiency of implicit kernel mappings (and infinite kernel mappings 
are infeasible), but it does give us the ability to do fine-scale feature 
selection as individual components of the kernel mapping can be extin- 
guished. The complexity of the feature selection algorithm is linear in the 
number of features so we can easily work with small expansions (such as 
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Figure J^.6. Cumulative distribution functions for the resulting effective linear coef- 
ficients with feature selection (solid line) and without (dashed line). 




Figure ^.7. ROC curves corresponding to a quadratic expansion of the features with 
feature selection p = 0.00001 (solid line) and without p = 0.99999 (dashed line). 



quadratic or cubic polynomials) by explicit mapping. The above problem 
was attempted with a quadratic expansion of the 100-dimensional fea- 
ture vectors by concatenating the outer products of the original features 
to form an approximately 5000-dimensional feature space. Figure 4.6 
shows that feature selection is still helpful (compared to a plain linear 
S VM classifier) , improving performance even when we have a larger and 
expanded feature space. 

In another experiment we used classification feature selection to label 
protein chains from the UCI repository which were valid splice sites into 
one of two possible classes: intron-exon or exon-intron. These are often 
also called donor and acceptor sites, respectively. The chains consist 
of 60 base-pairs again then represented in binary coded as: A=(1000), 
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C=(0100), G=(0010) and T=(0001). Uncertain base-pairs where repre- 
sented as mixed codes, for example, “A or C” would be represented as 
(0.5 0.5 0 0). Thus, we have 240 scalar input dimensions and a binary 
output class. We trained on 200 examples and tested on the remaining 
1335 examples. Figure 4.8 shows the performance. 



Acceptor/Donor Protein Classification 




Figure 4-8. Varying Regularization and Feature Selection Levels for Acceptor /Donor 
Protein Classification. Testing performance on unseen 1335 protein sequences. The 
dashed line indicates SVM performance while the solid lines indicate varying perfor- 
mance improvements due to feature selection. Optimal feature selection levels for this 
problem appear to be between p = le — 2 and p = le — 3. 

In training, linear classifiers can easily separate both classes at 100% 
accuracy, but using all the features causes over-fitting. The regulariza- 
tion introduced by varying c does not prune away features but rather 
impels the algorithm to ignore outliers. The best possible performance 
(as we vary c) attainable by regular SVMs is around 92% on the test set. 
To improve on this, the crucial observation is that not the whole length 
of the protein chain is useful in determining accept or /donor status. We 
would like to ignore dimensions instead of data exemplars. Experiments 
show that even a a small amount of feature selection can already im- 
prove performance significantly. Setting p = le — 2 or p = le — 3 yields 
the best generalization accuracy of about 96%. Error is halved from the 
SVM’s count of more than 100 errors to an error count of 50 with feature 
selection. Figure 4.9 depicts the linear model for the SVM as well as the 
pruned model for the feature selection technique. 

3.2 Feature Selection in Regression 

Feature selection can also be advantageous in the regression case 
where a map is learned from inputs to scalar outputs. When we sus- 
pect that a subset of input features might turn out to be irrelevant 
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Figure 4-9. Sparsification of the Linear Model. On the left are the parameters for 
the SVM’s linear model while on the right are the parameters for the feature selection 
technique’s linear model. Note the sparsification in the parameters as many are set 
to 0 on the right. This pruning encourages better generalization. 

(especially after a kernel expansion), we can again employ an aggres- 
sive pruning strategy by adding a “switch” (s^) on the parameters. The 
prior is P°(si) = p Si (l — p) l ~~ Si where lower values of p encourage greater 
sparsification. This prior is in addition to the Gaussian prior on the pa- 
rameters ( @i ) which do not have quite the same sparsification properties. 

The previous derivation for feature selection can also be applied in 
a regression context. The same priors are used except that the prior 
over margins is swapped with the one in Equation 4.1. Also, we shall 
include the estimation of the bias in this case, where we have a Gaussian 
prior P°(b) = Af( 0, a). This replaces the hard constraint X t = X f t 
with a soft quadratic penalty, making computations simpler. After some 
straightforward algebraic manipulations, we obtain an objective function 
of the form 

JW = £ yt( K - Xt) - e £(A t + A i) + £ log (At) + £ log(A't) 

t t t t 

- leg (l _ „-»•< + ^) - log (l - + yy 

— ^ ~ 1°® (l P + /9e5Et( A *” A 0*t,»] 2 ^ 

t i 

This objective function is optimized over (A $, X' t ) and by concavity has a 
unique maximum. The optimization over Lagrange multipliers controls 
optimization of the densities of the model parameter settings P(0) as 
well as the switch settings P(s). Thus, there is a joint discriminative 
optimization over feature selection and parameter settings. At the op- 
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Linear Model Estimator 


e-Sensitive Linear Loss 


Least-Squares Fit 


1.7584 


MED p = 0.99999 


1.7529 


MED p = 0.1 


1.6894 


MED p = 0.001 


1.5377 


MED p = 0.00001 


1.4808 



Table 4 A. Prediction Test Results on Boston Housing Data. Note, due to data 
rescaling, only the relative quantities here are meaningful. 



timal setting of the Lagrange multipliers, our resulting MED regression 
function is then: 



C(X) 




pZti 

P + (1 - P) exp (-(Ei( A t - A*)X m ) 2 /2) 



Xi + 6 , 



where the bias b is given by b = a — Xt). 

Below we evaluate the feature selection based regression (or support 
feature machine) on a popular benchmark dataset, the Boston housing 
problem from the UCI repository. A total of 13 continuous features are 
given to predict a scalar output, the median value of owner-occupied 
homes in thousands of dollars. To evaluate the dataset, we used linear 
regression and second order polynomial regression by applying a kernel 
expansion to the input. The dataset is split into 481 training samples 
and 25 testing samples (as in [182]). 

Table 4.1 indicates that feature selection (decreasing p) generally im- 
proves the discriminative power of the regression. Here the 6-insensitive 
linear loss functions (typical in the SVM literature) shows improvements 
with further feature selection. Just as sparseness in the number of vec- 
tors helps generalization, sparseness in the number of features is also 
advantageous. The total number of input features after expanding the 
second order polynomial kernel is 104, some of them with very little 
discriminative power, so pruning is beneficial. 

For the three trial settings of the sparsification level prior (p = 0.99999, 
p .= 0.001, and p = 0.00001), we again analyze the cumulative density 
function of the resulting linear coefficients Ci < x as a function of x 
based on the features from an explicit kernel expansion. Figure 4.10 
clearly indicates that the magnitudes of the coefficients are reduced as 
the sparsification prior is increased. 

MED regression was also used to predict gene expression levels using 
data from “Systematic variation in gene expression in human cancer cell 
lines”, by D. Ross et. al. Here log-ratios of gene expression levels were 
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Figure 4 AO. Cumulative distribution functions for the linear regression coefficients 
under various levels of spar sificat ion. Dashed line: p = 0.99999, dotted line: p = 0.001 
and solid line: p = 0.00001. 



Linear Model Estimator 


e-sensitive linear loss 


Least-Squares Fit 


3.609e+03 


MED p = 0.00001 


1.6734e+03 



Table 4-2. Prediction Test Results on Gene Expression Level Data. 



to be predicted for a Renal Cancer cell-line from measurements of each 
gene’s expression levels across different cell-lines and cancer types. Input 
data forms a 67-dimensional vector while the output is a one dimensional 
scalar gene expression level. Training set size was limited to 50 examples 
and testing was over 3951 examples. The table below summarizes the 
results. We set e = 0.2 and c = 10 for the MED approach. This indicates 
that feature selection is particularly helpful in sparse training situations. 

3.3 Feature Selection in Generative Models 

As mentioned earlier, the MED framework is not restricted to dis- 
criminant functions that are linear or non-probabilistic. For instance, 
we can consider the use of feature selection in a generative model-based 
classifier. One simple case is the discriminant formed from the ratio 
of two identity covariance Gaussians. The parameters O are (/i, u) for 
the means of the y = +1 and y = — 1 classes, respectively, and the 
discriminant is £(X;0) — log A 7(/x, J) — log N{v,I) + b . As before, we 
insert switches (si and r{) to turn off certain components of each of the 
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Gaussians giving 

£(X;©) = - Vif -"Y^riiXi - Vi) 2 + b. 

i i 

This discriminant then uses similar priors to the ones previously intro- 
duced for feature selection in a linear classifier. It is straightforward 
to integrate (and sum over discrete si and r z ) with these priors (shown 
below and in Equation 3.9) to get an analytic concave objective function 
J(A): 



P o^ =Af (OJ) P°(v)=Af{0J) 

P°(si) = p Si ( 1 - py~ Si P°(ri) = p ri ( 1 - p) 1-ri . 

In short, optimizing the feature selection and means for these generative 
models jointly will produce degenerate Gaussians which are of smaller 
dimensionality than the original feature space. Such a feature selection 
process could be applied to many density models in principle but com- 
putations may require mean-field or other approximations to become 
tractable. 

4. Kernel Selection 

Feature selection is by no means the only representational aspect of 
an SVM that we may want to consider. One crucial design issue of 
nonlinear SVMs is the choice of a kernel function [108, 36]. Kernels 
are an efficient method to implement higher order mappings of the data 
prior to linear classification. These higher order mappings effectively 
are representations of the data since they induce a different notion of 
distances between points and significantly change the interaction of the 
data with the (linear) model. However, the space of possible kernel 
selections is infinite and difficult to search. 

A kernel is an efficient computation of the inner product between 
two data points after a higher order mapping. For example, in the 
original space, two vectorial data points X\ and X 2 may undergo a 
mapping via the function <f>{X ). The kernel K(X 1 ^X 2 ) is then a scalar 
function over two such vectors which efficiently computes a notion of 
similarity or affinity between them K(Xi,X 2 ) = 4>{Xi) T 4>(X2). Thus, 
in the new feature space implied by the mapping </>(.), our original linear 
discriminant would get re-written as follows: 

£(X-,@) = 0 T <f>(X)+b. 

However, due to uncertainty, we may wish to consider a range of pos- 
sible mappings or kernel functions, i.e., </>i (-)-- < / ) m(-) or equivalently, 
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K\(., . )..Km( ■ , •)• To select between them we will again introduce a set 
of binary structural variables si, . . . , sm as was performed for the case_of 
feature selection. We also need to consider different models 0i, . . . , Om 
since each mapping may have a different dimensionality. Here we are 
using arrows on each 6 m vector to emphasize that these are vectors and 
the subscript is not indexing the dimension. The resulting discriminant 
function is then: 

M 

£(X;©) = Y, S ™%n<t>m(X) + b. 

771=1 

The resulting distribution over models and switches that MED recovers 
factorizes as follows: 

p(0, 7 ) = u^ =1 p(s m ,e m )p(b)nj =1 p( lt ). 

We recover this resulting distribution from the prior multiplied by the the 
exponentiated classification constraints (after ensuring normalization via 
the appropriate partition function Z( A)) as follows: 

P(0, 7) = ^n m P o ( Sm )P o (0 m )i 3O (6)n t P o (7i)e E * At[2/ ‘ £(Xt;e) - 7<] . 

Computing the normalizing partition function from the above is straight- 
forward, particularly due to the factorization of terms across the m = 
1..M different models and switches. We will assume M white Gaussian 
priors on the model vectors 0 m , M Bernoulli priors on the binary switch 
values s m and the usual margin priors and bias prior. We elucidate the 
component of the partition function that involves integrating over the 
model and switch variables (the bias and margin variables are integrated 
over as before): 



ZeAV 




= n 



M 



1 1 

^ ^ P ( 5 m) 6xp I s m ^ ^ ^ ^t^t , ytyt / K m (Xt, Xff ) 



X=0 



Lt' 



The resulting overall MED objective function is then: 

J W = Yl Xt + 1 °s ( 1 “ V c ) “ Y ( 

t \ t 
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The above is similar to the optimization of several SVMs each with its 
own kernel, Gram matrix and quadratic cost function yet the quadratics 
are nonlinearly mixed through the log-exp mapping (which disappears 
when p — 1) which is guaranteed to remain convex as we optimize to 
obtain the Lagrange multipliers A. To use the model for classification, we 
merely need to consider the expected value of the discriminant function 
according to the final distribution P(0): 

y = sign f P(Q)C{X;e)d@ = sign ^ y ( A« ^ X) + b. 

'' t m 

In the above we are using an expected value of the bias, b — a 2 Yt Vt^t 
and the expected value of each switch to construct a kernel by additively 
combining the individual weighted kernels: 

S = P 

m p + (1 - p) exp(— 1/2 \ t \ t ,y t y t ,K m {X u X tt )) ' 

Clearly, the above mixture behaves as an aggregated kernel which is 
guaranteed to satisfy Mercer’s theorem if the original individual kernels 
in the convex combination themselves satisfy it. It is often the case 
(particularly for small p settings) that many of the weights (i.e., the s m ) 
values will quickly vanish permitting us to simply ignore the contribution 
of many kernel functions and obtain better computational efficiency. The 
above derivation is straightforward for the case of kernel combination in 
regression problems. 

To evaluate the kernel combination selection, we used a subset of the 
UCI Isolet data set which performs alphabet letter recognition from a 
vector of audio features. There are naturally 26 classes in this prob- 
lem yet this multi-class system can be cast as 26 binary one-versus-rest 
classification problems. Each of the binary classification problems were 
trained with their own kernel estimation. The classifiers were given a 
choice of polynomial kernels of multiple orders and radial basis function 
kernels with a range of covariances and compute their own kernel com- 
bination. The polynomial kernels considered were 1st, 2nd, 3rd and 4th 
order while the RBF kernels used had standard deviations of 10,1,0.1, 
and 0.01. We explored various levels of feature selection and regulariza- 
tion parameter for the different classifiers in unison and their total error 
rate was computed as the sum of all binary errors. This is naturally 
more pessimistic than the regular multi-class classification rate since the 
proper classifier may still dominate the multi-class problem and gener- 
ate the correct label even though many of the other binary classifications 
were wrong. We used 200 points for training and 600 for testing. Fig- 
ure 4.11 summarizes the results. The curve where no feature selection 
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Figure Kernel selection test error rate over multiple polynomial and RBF ker- 

nels for the Isolet dataset over varying regularization (c) levels and multiple feature 
selection levels. The dashed line is a regular SVM, the dotted line is kernel selection 
with p = le — 1, the dotted-dashed line p = le — 2 and the solid line p — le — 3. 



is present is the usual SVM case where the various kernels have been 
summed with unity weight each. The figure clearly shows that increas- 
ing the feature selection level (reducing p) prunes away more kernels 
from the resulting SVM and isolates a better combination of kernels. 
The error rate is reduced by over 50% for the kernel combination esti- 
mate at the low values of p (lower values beyond p = le — 3 did not 
improve the result) when compared to the SVM with a uniform kernel 
combination. The fact that lower p consistently produced better accu- 
racy leads us to speculate that a single kernel is performing significantly 
better than the others in the combination. 

5. Meta-Learning 

At this point, we consider the meta learning setting. Here multiple 
tasks and models need to be learned yet, share a common underlying 
representation [14, 181, 30]. For instance, we may want to learn to 
regress the coordinates of facial features (eyes, nose, etc.) from mug-shot 
photographs. This involves learning multiple scalar regression models. 
However, since the input space is common to all of them, we may uncover 
invariants and noise properties within the input space more easily if we 
use all the data. Alternatively, we may have a database which has been 
heterogeneously labeled and wish to bootstrap learning from one labeling 
session to another. For instance, a text database of financial documents 
may have been labeled as stocks vs. commodities while another database 
of documents may have been labeled bullish vs. bearish. These two 
labeled databases may benefit from each other since they may share 
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similar relevant features and discard irrelevant features in the text (i.e., 
stop words, etc.). We will illustrate meta- learning with SVMs for binary 
classification problems with feature selection (but regression or kernel 
selection would be just as straightforward). For meta learning we assume 
we have M discriminant functions. Each of the corresponding M models 
01, • ■ • ,9m is D-dimensional and has a its own scalar bias &i, . . . , 
Meanwhile, these different discriminants (classifiers) share a common 
D-dimensional feature selection vector s. As training data, we will have 
m — 1..M different tasks each with t — l..T m input data vectors Xt^ m 
and binary labels An example of a given discriminant function for 
task m is then: 

D 

£(X ] S, Qmibm) — ^ ^ s d@m,dXd H~ b m . 

1 



Given that we have a total of ]P m T m pairs of input vectors and labels, 
the MED framework will have a the following classification constraints 
over t — l..T m and m = 1..M as we span all the datasets: 




b \ , . . . , 7) yt,m£{Xtmi ^m, bm) 7 tm 



dG > 0. 



These constraints will give rise to T m non- negative Lagrange mul- 
tipliers of the form A t m - Assuming the same Bernoulli priors on the 
switch vector, white Gaussian priors on the models, Gaussian priors on 
the biases and the usual margin priors, we only need to consider the 
component of the partition function dealing with the models and the 
switches: 






/ P°(s)n m P°(0 m ) ^d=0 s dQm,dXtm,d) 

JQ,S 

M [Tm ] 2 

^ ^ A tmytmXf;m,d 



1 / M 

n d=i X, • p 0 ( 5 d)exp [ ^ jr 

s d=0 



m = 1 



_t= 1 



We obtain the following MED objective function: 

ji 



J ( A ) = Xt ’ m + 1 °s ( 1 “ X tm/c) - y ( ytrnhn 

t,m m 



t 



D ( / 1 M 

J]log l — p + p exp -]T 



^ ^ ^tmytrnXtm^ci 



d= 1 



m= 1 
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After optimizing the Lagrange multipliers, each classifier is then com- 
puted from the sign of the expected discriminant: 

D 

P{®)C{X-S,Qm,b m ) ~ ^ ^ Sdfim,dXd ~t~ b m 

d= 1 

where we have: 

^ ytm^tmXtm 
t 

P 

P + (1 — p) exp(— 1/2 Y2m=l ^ tmytmXtm,dV ) 

yi ytmM m • 
t 

The above derivation can easily be extended to find a common kernel 
combination under multiple tasks instead of a linear feature selection. 
Furthermore, the regression case is straightforward and follows directly 
from the classification derivations. 

To evaluate the meta-learning method, we used the UCI Dermatol- 
ogy dataset and performed linear feature selection. This is a 6-class 
dataset which can be represented as 6 binary one-versus-many classi- 
fication problems. These 6 binary classification problems are likely to 
contain inter-dependencies that may be harnessed by meta-learning since 
they all emerge from a common underlying multi-class task problem. 
There are 33 dimensions in the data and we used 200 training exem- 
plars and 166 testing exemplars. Various settings of the regularization 
parameter c and the feature selection level p were explored. We report 
errors again as a total binary classification error which is pessimistic 
since we might get the correct multi-class label but incorrectly resolve 
a few one-versus-many binary decisions along the way. In Figure 4.12, 
we summarize the results of performing linear feature selection. We 
show only the classification error as we vary p since we optimized sepa- 
rately over c. The dashed red line depicts performance when each SVM 
for each binary classification task has its own feature selection config- 
uration (i.e., the independent learning case). Meanwhile, the solid line 
depicts the performance when each SVM has to share a common fea- 
ture selection configuration (i.e., the meta-learning case). The upper 
left corner of the plot is the performance of an SVM which arises when 
no feature selection whatsoever is used. Note that both independent 
and meta-learning coincide since there is no representational variable 
whatsoever here. It is clear that both methods improve on a regular 
SVM as feature selection is performed (up to a point). Furthermore, the 
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Figure 4 A 2. Meta learning and feature selection classification testing errors. Here, 
the multi-class classification problem is mapped into the standard 1 versus many 
binary classifications which are treated as several independent tasks in a meta-learning 
scenario. Varying levels of feature selection — log(p) are shown after optimizing over 
the regularization parameter c. The dashed line is the independent learning case 
where feature selection is done individually for each task. Meanwhile the solid line is 
the meta-learning case which ties a common feature selection across all tasks. 



meta-learning case which ties the feature selection configuration across 
all SVMs and tasks consistently does better than the independent case. 
In this scenario, it should also be noted that all the SVM task-specific 
models have access to the same input data (but different output labels). 
Therefore, the meta-learning is improving results purely on the basis of 
the inter-dependencies between the task and not because of the avail- 
ability of more input data from the input space. In other words, if the 
exemplars in the input space were different for each SVM and for each 
task in the meta-learning scenario, meta-learning would also have the 
added advantage of more samples in the input space to estimate a good 
representation or feature selection. 

6. Transduction 

In this section, we provide a maximum entropy discrimination frame- 
work for solving the missing labels or transduction problem [188, 94]. In 
many classification problems, labeled data is scarce yet unlabeled data 
may be easily available in large quantities. The MED framework can 
be easily extended to utilize the unlabeled data in forming a discrimi- 
native classifier by integrating over the unobserved labels. In previous 
work, we initially presented the MED approach for transductive classi- 
fication using primarily mean-field approximations [80]. Szummer [176] 
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also presents an alternative transduction approach in terms of kernel 
expansions which may also be cast in an MED formalism. 

In the classification setting, the exact solution of the resulting MED 
projection becomes intractable (just as in SVM based transduction). 
We first review our mean-field approximation case as a possible local 
solution [80]. A global information-projection solution is also possible 
if the prior over unobserved labels is described by a distribution that 
is conjugate (and continuous) to the original distribution over models. 
We thus also provide a transduction algorithm which computes a global 
large margin solution over both labeled and unlabeled data by forcing 
the prior to be conjugate. We subsequently discuss the use of unlabeled 
data in the regression scenario which does yield a tractable global MED 
solution. 



6.1 Transduct ive Classification 

SVM transduction requires a search over binary labels of the unla- 
beled exemplars. The complexity of this approach grows exponentially. 
Joachims proposes using efficient heuristics which approximate this pro- 
cess yet are not guaranteed to converge to the true SVM transduction 
solution [94]. Unlike SVMs, the MED approach permits a probabilistic 
treatment of the search over labels which is somewhat similar in spirit 
to relaxation methods. The discrete search problem is embedded in a 
continuous probabilistic setting. 

First, recall that MED solves for distributions over parameters as 
well as other unknown quantities by augmenting the solution space. For 
example, when margins are unknown in a non-separable problem, we 
introduced them into the solution as posteriors P( 0 , 7 ) (and in the prior 
as well). When feature selection structure was unknown, it too was 
cascaded into the final MED posterior solution as P( 0 , 7 , s). In the 
transduction case, unlabeled examples are given where yt is unknown. 
Thus, we can hypothesize a prior /posterior distribution as well. This 
distribution would ideally take on the form of two delta functions at — 1 
and +1. Thus, instead of solving for only a distribution over say P(0, 7 ) 
we generalize to the non-separable transductive case via P(0, 7 , y) where 
now projection is over a larger space. 

We have the following general solution: 

P(0,7,2y) = 

where Z( A) is the normalization constant (partition function) and A = 
{Ai, . . . , At} is again our set of non- negative Lagrange multipliers, one 
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per classification constraint. The Lagrange multipliers A are set by find- 
ing the unique maximum of the objective function J( A) = — logZ(A). 

A distribution for y is required such that the partition function Z( A) 
remains analytic. If we assume that the prior for (continuous) unla- 
beled y is given by the natural choice of a point-wise delta function, 
i.e. P°(yt) = l/25(ytjl) + l/2£(y*,— 1), the integrals above become in- 
tractable. To proceed, a mean-field approximation is performed which 
effectively computes the integral over P(0) with the unlabeled P(y) 
locked at a current estimate and then computes an update on the P(y) 
while the P(0) is held fixed. This is equivalent to assuming that the 
distribution, P(0,7,y) is forced to factorize according to P(@,j)P(y). 
Details are provided in [80] and produce good generalization results when 
unlabeled data is useful for a classification problem. However, the mean- 
field approximation forces us to obtain a local solution which is no longer 
unique. If our two-stage iterative algorithm is poorly initialized, this may 
be a problem. 

Here we take a different approach to the problem. Assume we have 
the fully labeled case and have computed the partition function Z( A) 
analytically. This partition function is log-convex by definition. We can 
now treat Z( A) itself as some exponential family distribution because 
it is a log-convex function of A for any setting of the y-variables. The 
y- variables can then be seen as data under this exponential family distri- 
bution while the A are its parameters. We could then treat some of the 
y-variables in the expression of Z( A) as unknown, multiply by a prior 
over them and integrate. This would give us the partition function for 
the transductive (partially labeled) case. However, we need to make sure 
that the prior is a conjugate distribution in y variables such that the in- 
tegral fyP°(y)Z(\,y) is analytic. For instance, if Z( A) is Gaussian (or 
equivalently J( A) is quadratic, as in an SVM) we should use a conjugate 
distribution P°(y) which is Gaussian to end up with a final Z( A) which 
is still log-convex and analytic. 

Assume that the data set is partitioned into two sets, the labeled 
and unlabeled data. There are Ti labeled components and T u unlabeled 
components. Thus, we can consider our y vector of labels and our A 
vector of Lagrange multipliers as being split as follows: 




The y labels are known however we do not know the y labels and only 
have a prior distribution over them. This prior is a scaled zero-mean 
spherical Gaussian, P°(y) = N( 0, k~ 2 I). This can also be interpreted as 
a prior over each individual unlabeled data point as P°(y) = Yl t P 0 (Vt) — 
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n t iv(o,«- 2 ). We can also consider the y and A vectors in a diagonal 
matrix form, as Y — diag (y) and A = diag(A) respectively. 



' Y 


0 ' 


A = 


' A 


0 ' 


0 


Y 




0 


A 



Similarly, we can consider the data matrix X as being a matrix of the X t 
data vectors arranged as columns. It can be further divided into labeled 
and unlabeled vectors as follows: 

X = [x X . 

For simplicity, we will derive the transduction assuming a linear classifier 
yet drop the bias term (i.e. 0 = 0). This will limit the linear decision 
boundaries that can be generated to those that intersect the origin. This 
restriction can be circumvented by concatenating a constant scalar to our 
input features X. However, the formulation we will show here readily 
admits kernels and can also be augmented with the bias term with a 
little extra derivation. Thus, our classifier’s discriminant function is: 

£(X;0) = 0 T X. 



Let us derive the corresponding partition function (up to a constant 
scalar factor): 

Z( A) = f [ [ p Q (e^,y)e^t[ytC{Xt\d)- lt ] 

J y J 6 J 7 

Z( A) = f [ P°{O)P°{y)e'Zt Xtyt0TXt x [ P°( 7 ) e -£ t An't 
J y J 0 

Z( A) = Z e ( A) x Z 7 ( A). 

We may also consider dealing directly with J( A) = — log Z(A) in which 
case we have the following decomposition of our objective function: 



J(A) — J 7 (A) + J„(A). 

Solving, we ultimately obtain the following standard J 7 ( A): 

J 7 (A) = £A, + J>g(l-A t /c). 
vt vt 

The remaining component of the partition function arising from inte- 
grating over the model distribution is is then: 

Jew = \ log \k 2 i - (Ax) T (Ax)| - ^a t [(xy) T (x?) 

+ (XY) t (XA) (k 2 J - (XA) t (XA)) -1 (XA) t (XT) 
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This provides our overall solution for the objective function. This is an 
analytic concave function with a single maximum that can be uniquely 
determined to obtain the answer P(0,7, y). 

THEOREM 4.3 Assuming a discriminant function of the form C(X\ ©) = 
0 T X and given a factorized prior over parameters and margins of the 
form P°(0, 7 ) = P°(0)P°( 7 ) where we have set P°(0) ~ iV(0, n~ 2 I), 
P°(y) N(0,I) and P°( 7 ) is given by P°( 7 *) as in Equation 3.9 then 

the MED Lagrange multipliers A are obtained by maximizing J( A) subject 
toO<X t < c: 

JW = Evt A t +log(l~At/c)+| log|/c 2 /-(AX) T (AX)| 

-|A T [(xy) T (xy)+(xy) T (xA)(K 2 /-(M) T (xA)) _1 (xA) T (xy)]A. 

It is interesting to note that in the case of no missing data, the above 
objective function simplifies back to the regular fully-labeled SVM case. 
The above objective function can be maximized via axis-parallel tech- 
niques. It is also important to use various matrix identities (i.e. some by 
Kailath [100] and some matrix partitioning techniques [116]) to make the 
optimization efficient. This optimization gives us the desired A values to 
specify the distribution P( 0 , 7 ,y). This constrained optimization prob- 
lem can be solved in an axis-parallel manner (similar to Platt’s SMO 
algorithm). In fact, we modify a single A t value at a time in an axis- 
parallel type of optimization. Since there are no joint constraints on 
the A vector, this step-wise optimization is feasible without exiting the 
convex hull of constraints. We derived an efficient update rule for any 
chosen lambda that will guarantee improvement of the objective func- 
tion. The update will be different depending on the type of A t we choose, 
basically if it is labeled or unlabeled. It is also particularly efficient tc 
iterate within the set of labeled and then the set of unlabeled Lagrange 
multipliers individually. This is because we store running versions of the 
large unlabeled data matrix, G and its inverse where: 

G = k 2 I- (XA) t (XA). 

There are a number of ways that we can now use the current setting oi 
the Lagrange multipliers to compute the labels of the unlabeled data. 
One way is to find the distribution over 0, i.e. P(0) for the current 
setting of A and integrate over it to obtain the classification. We now 
derive the computation of P(0): 

P(6) = f [ -i—P o (0, 7 ,y)e E ‘ At[?/t£(x<|0) “ 7t] 

J y J 7 ^ \ A) 

P(9) oc exp|-^ T (7-^(XA)(M) T )0 + ^A t y t Xf0 
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Thus, we have the P(6 ) ~ A E#). To obtain the classifier, we merely 
need the mean of the resulting Gaussian distribution over 6. Since 
m Nino, E 0 ), we have the following for our classifier: 



2/new = Sign 



(^P(0)£(X new |0)) = sign(/4X 



0- 



More specifically, the mean /. iq is given by the formula below (and the 
simplifications that follow): 



Ho 



Ho 



(XA)(XA) t ) x t 
T, ^tVt X[X new + ^ m t 'Xy X new 

t t f 



E ~ x » (' - 



where we have the following definition m = ft 2 y T (AX T XAG -1 A). This 
vector effectively defines the linear decision boundary. It is important 
to choose k large enough such that the unlabeled data influences the 
estimation strongly (small values of k will cause vanishing unlabeled La- 
grange multipliers, lock the unlabeled label estimates to 0 and effectively 
reduce to a standard SVM). Since all input vectors appear only within 
inner products computations, the formulation can readily accommodate 
kernels as well. 



6,2 Transduct ive Regression 

The previous assumption of a Gaussian over unlabeled data is actually 
much more reasonable for the regression case. In the previous section, 
we showed how unlabeled exemplars in a binary (±1) classification prob- 
lem can be dealt with by integrating over them with a Gaussian prior. 
However, the Gaussian is a continuous distribution and does not match 
the discrete nature of the classification labels. In regression, the outputs 
are scalars and are therefore much better suited to a continuous Gaus- 
sian prior assumption. If the scalar outputs do not obey a Gaussian 
distribution, we may consider transforming them (via their respective 
cumulative density functions) such that they are Gaussian. We may 
also consider using other continuous distribution as priors. However, 
the Gaussian has advantages since it has the conjugate form necessary 
to integrate over an MED linear regression problem (which results in 
a quadratic log-partition function in the non-transductive case). This 
guarantees that we will maintain a closed-form partition function in the 
transductive case. 

Why would we wish to use unlabeled data in a regression scenario 
and when is it advantageous? The basic motivation is that transductive 
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regression should focus the model such that its predictions on unlabeled 
data are distributed similarly to its predictions on the labeled data. In 
other words, when we extrapolate to new test regions in the unlabeled 
data, the regression function does not diverge and exhibit unusual be- 
havior. It should produce outputs that are similar to those it generated 
over the labeled data. This is illustrated in the following example where 
we fit a noisy sinusoidal data set with a high-order polynomial function. 
For example, note Figure 4.13. In the standard regression scenario in 
Figure 4.13(a), fitting a polynomial to the sin(a;) function without trans- 
duction generates a good regression on the labeled data (as dots) yet it 
sharply diverges on the unlabeled data (as circles) and produces predic- 
tions that are far from the typical range of [—1, 1]. If we instead require 
that the outputs on the unlabeled data obey a similar distribution, these 
will probably stay within [—1,1], generate similarly distributed output, 
and produce a better regression fit. This is illustrated in Figure 4.13(b) 
where the polynomial fit must obey the output distribution even when we 
extrapolate to the unlabeled data (at x > 10). It is important, however, 
not to go too far and have the regression function follow the prior on the 
unlabeled data too closely and compromise labeled data fitting as well 
as the natural regularization properties on the parameters. Therefore, 
as usual in MED with a multi- variable prior distribution, it is important 
to balance between the different priors. 




Figure 4-13. Transductive Regression versus Labeled Regression Illustration. Here, 
we are attempting to fit the sin(x) function with a high order polynomial. Labeled 
data is shown as dots while unlabeled data are shown as circles on the x-axis. In (a) 
the unlabeled data are not used and we merely fit a regression function (solid line) 
which unfortunately diverges sharply away from the desired function when it is over 
the unlabeled data. In (b), the polynomial must maintain a similar distribution of 
outputs (roughly within [-1,1]) over the unlabeled exemplars and therefore produces 
are more reasonable regression function. 
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We begin with a regular (non-transductive) regression. A support 
vector machine is typically cast as a large-margin regression problem 
using a linear discrimination function and an epsilon-tube of insensi- 
tivity with linear loss. Given input data as high dimensional vectors 
Xi, . . . , Xt and corresponding scalar labels yi, . . . , j/t we wish to find a 
linear regressor that lies within e of each output. The regressor is again 
the same discriminant function, C(X\ ©) = 6 T X+b. Recall the objective 
function for the regression case in Theorem 4.2. If we assume, instead 
of a non-informative prior on P°(b) a zero- mean Gaussian prior with 
covariance a we obtain the following slightly modified objective function 
(which must be optimized subject to 0 < A t < c and 0 < X f t < c): 

J ( A ) — X/ ytftt — A *) ~ e lo g( A *) + Ils i°g( A 0 

t t t t 

- log (l - - log (l - e-* + 

K - At) 2 - l E(A t - A' 4 )(A t , - \' t ,){XjX t ,). 

t t,t' 

In the case of unlabeled data, we do not know some particular yt values 
and must introduce a prior over these to integrate it out and obtain 
the partition function. The prior we shall use over the unobserved yt 
is a white noise Gaussian prior. This modifies the above optimization 
function as follows. Observe the component of J( A) that depends on a 
given y t : 

J(A) = + + 

Going back to the partition-function representation of that component 
we have: 

z{ A) = .. X exp (-y t (\' t - A*)) x .. 

If the yt value of the above is unknown, we can integrate over it with a 
Gaussian distribution as a prior, i.e. P°(yt) ~ N( 0, 1). Another possible 
choice is a uniform prior for P°(yt)- The Gaussian prior gives rise to the 
following computation: 

Z{\) = ..xj exp exp (-y t (\[ - A t )) x .. 

Ultimately our updated transduction J( A) function is modified as follows 
for the unlabeled data exemplars: 

J(A) — •• + -(A' t - A ( ) 2 + .. 
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Therefore, for the transductive regression case, we obtain the following 
final objective function: 



m = E w(a;-a ( )+ e 

tElabeled t£unlabeled 

+ E lo s( A «) + E - e E( A * + A t) 



t t 

-Ate 



— log I 1 — e 



+ 



A, 



c - A t 



— log I 1 - e x ' te + 



AJ 



Ai 



-^(E x 't ~ A <) 2 - 1 E< a * - A *)( A *' - K')( x T x t')- 



t,t f 



The final P(0) computation is straightforward to find given the max- 
imizer A* of J( A). This effectively generates a simple linear regression 
model which takes into account the unlabeled data. In practice, the yt 
values don’t have a white Gaussian distribution so we pre-process these 
by transforming them into a white Gaussian (via standard histogram 
fitting techniques or just a whitening affine correction). We then solve 
the MED regression. The transformation is finally inverted to obtain yt 
values appropriate for the original problem. 

Figure 4.14 depicts results on an the Ailerons data set (by R. Ca- 
macho) which addresses a control problem for flying an F16 aircraft. 
The inputs are 40 continuous attributes that describe the status of the 
airplane (i.e. pitch, roll, climb-rate) while the output is the control ac- 
tion for the ailerons of the F16. An implicit second-order polynomial 
(quadratic) kernel was used as a regression model. For the labeled case, 
we trained on 96 labeled data points (using standard SVM regression). 
The MED transductive regression case used 96 labeled and 904 unlabeled 
examples for training. Figure 4.14 depicts better regression accuracy for 
transduction techniques at appropriate levels of regularization (while 
the non transductive regression remains somewhat fixed despite varying 
regularization levels). 

It appears that the transduction is mostly useful when the labeled data 
was ambiguous and could cause large errors when we extrapolate too far 
away from it to distant points in our unlabeled test data. The Gaussian 
prior on unobserved variables effectively constrains the extrapolation 
caused by over-fitting and prevents unlabeled examples from generating 
extreme regression outputs. If the unlabeled examples are, however, in 
the convex hull of the labeled ones, transductive regression is unlikely 
to be beneficial. 
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Figure 4-14- Transductive Regression versus Labeled Regression for Flight Control. 
The above show the inverse RMS error for the labeled regression case (dashed line) 
and the transductive regression case (solid line) at varying c- regularization levels. 



7. Other Extensions 

In this sections we will motivate some extensions at a cursory level 
for completeness. More thorough derivations and results concerning 
anomaly detection, latent anomaly detection, tree structure learning, 
invariants, and theoretical concepts can be found in [80, 86] and [121]. 
The MED framework is not just limited to learning continuous model 
parameters like Gaussian means and covariances. It can also be used 
to learn discrete structures as well. For instance, one may consider us- 
ing MED to learn both the parameters and the structure of a graphical 
model. For instance, 0 may be partitioned into a component that learns 
the discrete independency structure of the graphical model and a com- 
ponent that learns the continuous parameters of the probability tables. 
Since the MED solves for a continuous distribution over the discrete 
and continuous model components, its estimation will remain straight- 
forward. 




Figure 4-15- Tree Structure Estimation. 
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For example, consider solving for tree structures where a classifier 
results from the likelihood ratio of two tree distributions. We have a 
space of dimensionality D and therefore D nodes in each tree to connect. 
In Figure 4.15 we show an example where 5 nodes are to be connected in 
different tree structures. One configuration is on the left while the other 
is on the right. The resulting discriminant function has the abstract 
form: 



£(*;©) 



log 



P(X\0 + ,E+) 
P(X\0-,E _) 



+ b. 



Here, the O model description will be composed of both a set of 6± 
continuous parameters for each tree as well as structure components E± 
which specifies the configuration of edges which will be present between 
the various nodes. The classification constraints will then involve not 
only an integration but also a summation over discrete structures: 




[y t C(Xtie)-'yt]d0+d0-d'y t db 



> 



0 Vt. 



Similarly, computation of the partition function Z( A) will require inte- 
grating the exponentiated constraints multiplied by the prior distribu- 
tion over P°(0+, 0_, £7+, £?_, 7, b). Since there is an exponential number 
of tree structures that could connect D nodes, summing over all E+ and 
E _ would be intractable. However, due to some interesting results in 
graph theory (namely the matrix tree theorem), summing over all possi- 
ble tree structures of a graph can be done efficiently. This is reminiscent 
of Section 3 where we discussed an alternative form of structure learning. 
There, we also solved for a discrete component of the model, namely fea- 
ture selection. We similarly had to sum over an exponential number of 
feature selection configurations. However, in this problem and the ear- 
lier one, embedding the computation into a probabilistic MED setting 
makes it solvable in an efficient way. Further details on tree structure 
estimation will be omitted here yet are provided in [80] and [121]. 




Chapter 5 



LATENT DISCRIMINATION 



Entities should not be multiplied unnecessarily. 

William of Ockham, 1280-1349 



We have discussed the maximum entropy discrimination framework 
for optimizing the discriminative power of generative models. It maxi- 
mizes accuracy on the given task through large margins just as a sup- 
port vector machine optimizes the margin of linear decision boundaries 
in Hilbert space. The MED framework is straightforward to solve for ex- 
ponential family distributions and, in the special case of Gaussian means, 
subsumes the support vector machine. We also noted other useful models 
it can handle, such as arbitrary-covariance Gaussians for classification, 
multinomials for classification and general exponential family distribu- 
tions. Nevertheless, despite the generality of the exponential family, we 
are still restricting the potential generative distributions in MED to only 
a subclass of the popular models in the literature. To harness the power 
of generative modeling, we must go beyond such simple models and con- 
sider mixtures or latent variable models. Such interesting extensions 
include sigmoid belief networks, latent Bayesian networks, and hidden 
Markov models, which play a critical role in many applied domains. 
These models are potential clients for our MED formalism and could 
benefit strongly from a discriminative estimation technique instead of 
their traditional maximum-likelihood incarnations. 

Unfortunately, computational problems quickly arise when discrimi- 
native estimation is directly applied to latent models like mixture models 
and models with hidden variables. The latent aspects of these models 
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typically prevent them from being computationally tractable in both 
generative and discriminative settings. Taking motivation from Ock- 
ham’s quote above (it is unlikely Ockham was referring to computational 
issues), we will use bounds on the constraints in the latent MED formu- 
lation to work our way towards a tractable iterative solution. Otherwise, 
if treated exactly and multiplied exactly, latent models would cause an 
exponential explosion in the number of terms in the MED solution and in 
its objective function J( A). Our bounds are related to a variety of meth- 
ods in the literature for iteratively handling latent problems that other- 
wise become computationally intractable [40, 134, 161, 75, 88, 89, 81, 84]. 
They will let us iteratively constrain and solve the MED problem, grad- 
ually converging to a solution without any intractable steps along the 
way. 

We first note that it is possible to bound the intractable constraints 
in the MED latent setting so that they are replaced with simpler yet 
stricter versions. By invoking Jensen’s inequality and by considering 
additional constraint equations for each data point, the MED problem 
can be updated in closed form. To handle latent variables, we create 
many additional constraints and Lagrange multipliers. These arise be- 
cause each latent configuration in the incorrect models must be upper 
bounded by the log-likelihood of the correct latent model. We explicate 
the case of binary discrimination with two mixtures of Gaussians. For 
the case of non-informative priors, estimating a mixture of Gaussians 
actually involves solving the standard support vector machine equations 
iteratively. We show that this iterated support vector machine solution 
has a slightly modified Gram matrix. This Gram matrix is actually 
iteratively adjusted by the posterior distribution over latent variables, 
while the mixture model parameters themselves are adjusted to discrim- 
inatively maximize the classification margins. 

An iterative algorithm reminiscent of sequential minimal optimization 
(SMO) [147] is outlined that optimizes the latent model. This sequen- 
tial algorithm brings forward an important efficient implementation for 
latent MED. It allows us to handle latent models efficiently, even if they 
have exponentially many latent configurations and give rise to exponen- 
tially many MED constraints and Lagrange multipliers. This is done by 
leveraging the factorization properties of the posterior distribution on la- 
tent variables. This factorization is inherited by the Lagrange multipliers 
which are constrained to vary only as scaled versions of the posteriors on 
latent variables. This insight permits us to consider the large panorama 
of structured latent models such as hidden Markov models where hidden 
variables are not fully independent and could quickly create intractably 
large state spaces. 
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The result is a discriminative variant of the expectation-maximization 
algorithm, which still enjoys similar efficiencies on latent graphical mod- 
els and latent Bayesian networks, yet maintains the large margin clas- 
sification requirements that are appropriate for discriminative learning 
tasks. Even intractable latent models are potentially accessible to this 
latent MED formulation via mean-field and structured mean-field meth- 
ods [81]. 

1. Mixture Models and Latent Variables 




Figure 5.1. Graph of a Standard Mixture Model 

The most straightforward generalization beyond the class of exponen- 
tial family distributions is to consider mixtures. In Figure 5.1 we show 
such a model, where each observed datum X has a hidden latent parent 
m, which selects which emission distribution in the mixture the datum 
is sampled from. We begin by considering a simple mixture model 

M 

P(X\0) = £>(m)P(X|m,0) = Y, a j p ( x \ e j)- 

m j = 1 

One possible parametric form of the mixture model would be as mixture 
of exponential family distributions 

M 

P{X\0) = Y. a i^MX)+nX) T 9j-Kj(9j)). 
j = 1 

Here, we are reusing the notation and natural parameterization for the 
exponential family introduced in Chapter 2. By abuse of notation, we 
will treat indexes such as j and random variables such as m interchange- 
ably although the meaning should be clear from the context. Note that 
the aj are scalars that sum to unity and represent the mixing propor- 
tions P(m) for each model in the mixture. For now, merely think of the 
above as a standard mixture model, such as a mixture of Gaussians [16]. 

Why can’t we apply expectation maximization in a discriminative 
setting? The EM algorithm is indeed the workhorse of latent variable 
models and mixture models. It would be straightforward in a binary 
classification problem to split the positive and negative exemplars into 
two separate data sets and then use EM on each to estimate mixture 
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Figure 5.2. Generative versus Discriminative Mixture Models. Thick circles repre- 
sent Gaussians over the o’s, thin circles represent Gaussians over the x’s. 



model parameters ay and 6j. However, the criterion the EM algorithm 
maximizes is log-likelihood, which is clearly not discriminative. Consider 
using a mixture model in the binary classification problem setting de- 
picted in Fig. 5.2. Here, we have a training data set which consists of o’s 
(positive class) and x’s (negative class). These have been sampled from 
eight identity-covariance Gaussians (four for each class). Each Gaussian 
also had equal prior probability. We will fit this data with a two-class 
generative model which incorrectly has 2 Gaussians per class (again each 
Gaussian is forced to have the same mixing proportion and the same 
identity covariance). Two solutions are shown: the maximum likelihood 
parameter configuration in Figure 5.2(a) and a more discriminative set- 
ting (by maximizing the conditional likelihood) in Figure 5.2(b). Each of 
the 4 Gaussians in the model is depicted by its iso-probability contour. 
The 2 thick circles represent the positive (o’s) Gaussian models while the 
2 thin circles represent the negative (x’s) Gaussian models. We also see 
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the values of the joint log-likelihood l for each solution. Note how the 
maximum likelihood configuration has a much higher likelihood value, 1. 

However, in practical applications, these distributions will be used to 
make inferences, for example, to induce a classification boundary. Points 
where the positive two-Gaussian model has higher likelihood than the 
negative one will be assigned to the positive class and vice-versa. In 
Figure 5.2(a) this results in a decision boundary that splits the figure 
in half with a horizontal line across the middle. This is because the 
positive Gaussians overtake the top half of the figure and the negative 
ones overtake the bottom half of the figure. Counting the number of 
correct classifications, we see that the ML solution performs as well as 
random chance, getting roughly 50% accuracy. This is because the ML 
model is trying to cluster the data and place the Gaussian models close 
to the samples that belong to their class. In fact, in ML, fitting positive 
models to positive data is done independently of the fitting negative 
models to negative data. This is how EM iterates as it increases its 
log-likelihood objective function since its objective is to get as good a 
generator of the data so classification performance is sacrificed. 

Meanwhile, in Figure 5.2(b), the decision boundary that is generated 
by the model creates 4 horizontal classification strips (as opposed to just 
splitting the figure in half). These strips emerge from the 4 vertically 
interleaved Gaussians. The regions classify the data as positive, nega- 
tive, positive and negative respectively as we go from top to bottom. 
The resulting accuracy for this fit is almost 100%. Clearly, better dis- 
crimination is possible when dealing with latent variables and MED is 
now brought to bear on this problem. 

Let us first see how far we can follow the standard recipe in Chapter 3 
to perform discriminative binary classification via the log-ratio of two 
different mixture models. The discriminant function we obtain is then 



£(**;©) = log 



E m P(m,x t |e+) 

£ n P(n,X t |©-) 



+ b. 



More specifically, given our usual binary classification problem setup 
involving X \, . . . , Xt with corresponding binary labels y \, . . . , ijt, the 
parameters for a mixture model can be split as © = {0 + , © _ , b}. These 
parameters correspond to the positive class model, the negative class 
model and the bias, respectively. The MED approach should recover a 
distribution P ( 0 ) over all these parameters which satisfies the required 
discrimination or classification constraints. 

However, this approach on its own is not fruitful since the required 
MED computations and integrals become intractable. Since the mix- 
ture models within the discriminant functions are no longer within the 
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exponential family, the MED solution and optimization problem will 
have an excessively large number of terms to evaluate. Only exponential 
family distributions remain in (or collapse back into) the exponential 
family when multiplied and do not give rise to an exponential number 
of terms. Note the classification constraints on our latent log-likelihood 
ratio discriminant function: 



I P(6)yt ( 



log 



j: m p(m,x t \e+) 

£ n P(n,X t |©~) 



+ b 



-7 t d@ > 0 t = 1..T. 



In MED, these constraints define a convex hull or admissible set V of 
allowed distributions P(0). We must explore this hull while minimizing 
the Kullback-Leibler divergence to a prior distribution P°(0). Here, for 
simplicity, we are omitting the distribution over margins and requiring 
separability with manually specified scalar-valued margins (set at for 
instance, j t = 1 t = 1..T). All derivations can be readily extended to 
handle non-separable problems by using a distribution over margins as in 
Chapter 3. The MED solution P(0) then has the usual form: the prior 
multiplied by the exponentiated constraints, which are each scaled by 
Lagrange multipliers. However, when computing the partition function 
Z( A), these exponentiated constraints turn into a product of sums, which 
has an exponential number of terms. For instance, consider the latent 
classification problem with t = 1..T observations consisting of a positive 
mixture model with M components and a negative mixture model with 
N components. The resulting MED solution and the partition function 
involve products of the mixtures from each observation yielding a total 
of M t + N t terms to describe Z( A) or P(0). This makes it intractable 
to optimize the Lagrange multipliers and use the MED solution in real- 
istic machine learning problems. The main culprits are the classification 
constraints, which yield a complicated convex hull on P(0) solutions. 

We circumvent this problem by bounding the classification constraints 
with simpler yet stricter constraints such that the integrals are solvable 
without intractable computation. In other words, we would like to avoid 
explicitly using classification constraints of the form f P(Q)[ytC(Xt ; 0) — 
7 t]dQ > 0. We instead simplify each while also further restricting the 
convex hull of possible P(0) solutions to a subset. This restriction 
actually converts our MED projections into exponential family form fully 
tractable projections. This will yield a tractable iterative algorithm in 
the latent discrimination problem that mirrors the EM algorithm. This 
iterative MED algorithm is efficient, but no longer globally optimal. EM 
itself is plagued by local minima issues anyway, however, unlike EM, this 
latent MED approach will be task-related and estimate a discriminative 
parameter setting for the mixture model. 
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2. Iterative MED Projection 

We approach the MED solution conservatively and estimate the dis- 
tribution P(0) by an iterative scheme where we devise an update rule for 
P(0) which slowly moves it from a current setting P l (0) to an improved 
one P l+1 (0). Figure 5.3 presents the general picture. Essentially, the 
original admissible set of constraints V is avoided as we iteratively em- 
ploy smaller convex hulls of constraints that are more conservative than 
V yet are also simple enough to avoid the aforementioned intractabilities 
when we solve for the actual MED information projections. 




Figure 5.3. Iterated Latent MED Projection with Stricter Constraints. Direct pro- 
jection to the admissible set V would give rise to an intractable MED solution due 
to the mixtures in the partition function. Instead, a stricter convex hull within the 
admissible set is found using the Jensen inequality on the correct class’ log-likelihood 
model for each datum and then repeating the constraints to ensure that the Jensen 
bound is larger than each individual latent configuration of the incorrect class’ log- 
likelihood model. These simpler constraints are all now expectations of exponential 
family forms, which gives rise to a closed form MED projection. The process is it- 
erated until we converge to a locally optimal point that is as close to the prior as 
possible while remaining in the admissible set. 

We select the more restrictive convex hull of constraints based on our 
current estimate of P*(0). The MED projection from the prior to this 
new convex hull of constraints is then P 2+1 (0) which is used to seed the 
next iteration. Since the convex hull of constraints for P l (0) is stricter 
than the original true admissible set and its intractable constraints, this 
new P z+1 (0) solution distribution is closer to the prior yet still lives 
within the original convex hull of constraints P. We then find a new 
stricter convex hull of constraints that is adjusted to the most recently 
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estimated P z+1 (0) such that, loosely-speaking, it is centered around the 
current solution distribution and repeat the process iteratively. Each 
projection we solve for should improve the convex hull and bring us 
closer to the prior. This will iteratively move our solution distribution 
closer to the prior P°(0) while never leaving the original convex hull of 
constraints in the original intractable problem. 



3. Bounding the Latent MED Constraints 

The above scheme hinges on our ability to find a stricter convex hull 
of constraints that stays within the original convex hull V. Furthermore, 
this stricter convex hull should have a simpler form so that the MED 
information projection reduces to a tractable exponential family type 
of solution. We would like this convex hull of constraints to agree with 
our current setting of P l (0), so that we iteratively progress towards a 
configuration of P z+1 (0) that is closer to the prior distribution P°(0). 
We assume we know P l (@) or have initialized it with some initial guess 
configuration. In fact, the initial guess need not actually live within the 
convex hull of constraints as long as our updated versions do. Look- 
ing more closely at our classification constraints, we can see that each 
constraint defines a half-plane in the space of P 2+1 (0), constraining the 
choice of valid distributions in our MED or minimum relative entropy 
optimization problem. For instance, consider a single constraint that 
emerges from a given Xt input datum and a corresponding label 
which, for now, we assume (without loss of generality) happens to be 
positive, in other words yt = +1. This gives rise to the classification 
constraint 




E m P(m,X t |0+) 

Z n P(n,x t \e-) 



+ b-j t 



d& > 0. 



Clearly, this defines a half plane constraining the legitimate choices for 
the possible solution distribution P l+1 (0) or, P(0) for short. One way 
out is to make this constraint (and others) stricter guaranteeing that 
if P(0) satisfies the stricter constraints, it will still satisfy the original 
constraints that formed the admissible set V. For instance, we can 
consider applying the Jensen inequality [93] on the positive log-sum: 



log^P(m,X t |0+) 

m 



> 



5 2®t ( m ) lo s 

m 



P(m,Xt|0+) 

Qt{m) 



This inequality holds for any choice of the M non-negative scalar quanti- 
ties Qt(m), which are constrained to form a distribution by summing to 
unity as in Ym Qt{ m ) = 1- This is due to the concavity of the logarithm 
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function. This well-known inequality is precisely the tool underlying the 
expectation maximization algorithm. Using Jensen yields the stricter 
constraint: 

fP(G)(Zm Qt(m) log P(m,Xt|0+)+^(Qt)-logEn P(n,X t \Q-)-hb- 7t )dG > 0. 

Note that the inequality has inherited a constant additive term that 
depends on the entropy H(Qt) = — ]T) m Q*(ra) log Qt{m) of the distri- 
bution Qt. Given a current hypothesis for P l (0), one would select a 
priori a setting for the Qt(m) which ensures that the bound is as tight 
as possible for the current P l (Q). A good choice for Qt is the poste- 
rior distribution on latent variables given the current distribution over 
models 



exp(/pi(O)logP(m,X t |0 + )de) 

<m Em ex P(/- pi (0)log^ > ("i,Xi|0+)d0)' 

Other choices are also permitted, including forced factorizations of the 
posterior via mean-field methods, when intractable models are consid- 
ered (see Section 9). Choosing the posterior distribution on latent vari- 
ables for our Qt has the approximate effect of centering the convex hull of 
constraints around the current projection jP*( 0), since the lower bound 
is close to the left hand side when P(0) = P l (Q) and then falls off 
away from that configuration until the constraint is violated. One pleas- 
ant result from applying Jensen’s inequality is we have eliminated one 
log-sum and the intractabilities it can cause. However, we still must 
contend with the other log-sum involving the negative class model, 0“. 
Jensen is not helpful here since the second log-sum term is negated and 
Jensen there would produce an upper bound. Instead, note the following 
equality produced from the epigraph of the log-sum after some simple 
algebra: 



io e £„ r<.., i®-) = s,( £ :g;ay-, ) • 

For short, we can define the following variable weights: 

P(n,X t |0-) 



Qt(n) = 



Z n P(n,x t \@-y 



We rewrite the above epigraph by maximizing over the weights on the 
right hand side: 



) = max ^Q«(n)logP(n,X ( |0 ) + H(Q t ). 
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Note, the maximization here explores all possible settings of Qt a dis- 
tribution (or vector) of non-negative quantities that sums to unity. At 
this point, we shall simplify the above equality and replace it with a 
winner-takes- all approximation. This is a typical approximation of the 
log-sum where we approximate the sum with its maximum. For instance, 
k-means, a popular variant of expectation-maximization performs such 
an approximation when it mimics maximum likelihood problems. In- 
stead of considering the right hand side above as a maximization over 
all possible continuous Qt settings, we thus restrict the maximization to 
only explore unit-vector settings for Qt, in other words Qt(n) € {0,1} 
and have the following approximation to the log-sum: 

logV) P(n,Xt|0") « _ max V] Qt(n) log P(n, Xt|0 _ ). 

n Qt such that Qi(™)<E{0,l} ^ 

This indicates that we are searching over all possible settings of the 
hidden variables instead of handling their convex combination and for 
most mixture model and latent model settings this is acceptable since 
mixture components are typically dissimilar. Note that the entropy term 
vanished since binary settings of Qt have zero entropy. Due to the dis- 
crete maximization, we can represent the above approximation and max- 
imization over binary Qt by replicating our classification constraints N 
times. We will have N different simpler classification constraints corre- 
sponding to each of the N possible settings of Qt . These classification 
constraints are therefore 



/ ^(®)(Em Qt{m) log P(m,Xt\@ + )-\-H(Qt)— log P(n,Xt\0~)+b—^ft)dQ > 0 n=l..N. 

The additional constraints give rise to additional Lagrange multipliers 
which can create extra computational work but we will later outline 
various efficiencies to alleviate this problem. Effectively, the additional 
constraints ensure that the Jensen-bounded term for the correct model 
in the discriminant function is larger than each possible setting of the 
latent variable n = 1..N in the incorrect model. We can now perform 
the above manipulation on all t = 1..T constraints in the latent MED 
problem. 

Consider our training set, Xi , . . . , Xt and their corresponding binary 
class labels This dataset can be split into T p positive ex- 

emplars where y t = + 1 and T n negative exemplars where y t = — 1. 
Whenever we have y t = + 1, we expand our T p positive classification 
constraints as above to obtain N x T p total classification constraints. 
Similarly, whenever y t = — 1, we expand our T n negative classification 
constraints to obtain M x T n classification constraints. This gives the 
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following NT p + MT n classification constraints. For t G T p : 



fP(&){?: m Qt{m)logP(m,Xt\G + )-l°gP{n,Xt\e-)+H(Q t )+b-'yt)de > 0 n=l..N. 



For t G T n where yt = —1 we have the following constraints: 



f P(®)(J2 n Qt( n ) log .P(n,.Xt I© - )— log P(m,Xt\Q + )+H(Qt)—b—'yt')d® > 0 

For the sake of compactness, consider the following vectors q* and qt- 
These are our posterior distributions over the latent variables under the 
current distribution P l (0) 



q t (m) = 

q t(n) = 



exp (/P j (e) log P(m, -Y t |0 + )rf0) T 15 11 

Em ex P (/ pi (@) log P(m,X t 1 0 +)d 0 ) 

exp(JPH0)logP(n,^|0-)d©) 

E n ex P (f pi (®) logP(n,X t \&-)de) 



We can thus write the above constraints simultaneously as 



fP(&)yt(Z m Qtj(rn) \ogP(m,X t \Q+)-Y: n Qt j (n)logP(n,X t \e-)+b)dQ-^ > 0 Vt.Vj 

where t iterates over each datum 1..T and j iterates over 1..N if y t is 
positive and over 1..M if y t is negative. Here we have also introduced 
the Q and Q, posterior distributions over the latent variables for each of 
the possible constraints. For instance, Q is of the size (NT p + MT n )M 
and given by the following formula: 




Vt € T p , j = 1..N 
Vt€T n ,j = l..M ’ 



(5.3) 



where the delta function S(m = j) is unity when the m and j indexes 
are equal and is 0 otherwise. Meanwhile, the distribution Q is of size 
(NT p + MT n )N and given by 




S(n = j) 
Qt(n) 



Vt G T p , j = 1 ..IV 
Vt € T n , j = 1..M . 



(5.4) 



Furthermore, we have also conveniently absorbed the entropy terms in 
the classification constraints into our new updated margin scalars 7 f. 



j 7 t-P(q t ) Vt G T p 
\ It ~ H(q t ) Vt G T n . 



(5.5) 



Given the above constraints, the MED solution distribution is (al- 
most) limited to a convex subset of the original convex hull V of con- 
straints, yet is solvable in closed form. In fact, in this form, the MED 
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projection step can be solved in closed form and turns out to be an ex- 
ponential family type of projection. Theorem 5.1 formalizes this closed- 
form update rule for the parameter distribution P 2 + 1 (0). 

Theorem 5.1 The iterative update rule for the latent MED solution 
distribution P t+ 1 (0) given the current setting of the latent distributions 
Q l and Q l is given by finding the P* + 1 (0) that minimizes the divergence 
KL(P(Q)\\P°(Q)) to the prior P°(0) subject to the constraints: 



where t iterates over each datum 1..T and j iterates over 1..N if yt is 
positive and over 1..M if yt is negative. The MED updated solution 
pi+i( 0 ) i s then se t t 0 ^e following: 



pi + 1/0X _ P°(&) c 'Eti yt(Em Q t j{m)\o Z P{m,X t \<c>+)-'Z n Q tj (n) log P(n,X t |©~ )+b) 
^ > Z(X) 



where the partition function Z( A) is defined over a set of non-negative 
Lagrange multipliers Ai,...,A NT p +MT n j which are given by the unique 
maximum of the concave objective function J( A) = — logZ(A). 



Solving for the optimal Lagrange multipliers gives us our new estimate 
of P* + 1 (0), which we then use to update the Q and Q distributions for 
the next iteration. We thus have an iterative algorithm which alternates 
between updates of the P(0) distribution and updates of the Q*, Qt 
distributions, as well as the margins 7 1 (for all t). The update rule for 
the latent distributions Q and Q is given by the following theorem. 



THEOREM 5.2 The iterative updates for the latent distributions Q , Q 
and margins 7 1 are given by the current setting of the MED solution 
distribution P l (@) as elaborated in Equation 5.2 , Equation 5.3 , Equa- 
tion 5.4 and Equation 5.5 respectively. 



Iterating and interleaving these update rules converges to a local min- 
imum where the solution P(0) is close to the prior and still satisfies the 
classification constraints. This will not necessarily converge to the global 
optimum and depends on the pseudo-random initialization of P(0), or, 
alternatively the pseudo-random initialization of the Q,Q distributions. 
To alleviate the local minima issue, one may consider deterministically 
annealing the Q and Q distributions by dividing all the terms being 
exponentiated in Equation 5.2 by a scalar temperature value T. This 
temperature is initialized to a large positive value and slowly reduced 
towards unity during iterations of the algorithm [185]. Alternatively, 
one may use simulated annealing, possibly by randomizing entries in the 
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Q and Q distributions. Note, throughout such quasi-global optimization 
heuristics, both latent distributions Q and Q should always remain as 
valid distributions with non-negative entries and sum to unity for each 
of the NT p + MT n classification constraints. 

In fact, we shall eventually see that it is not necessary to explicitly 
optimize the full problem involving NT p + MT n constraints. This is 
because both the maximization of the objective function — log Z(X) and 
the MED solution distribution P(&) will involve additional efficiencies 
where the Lagrange multipliers will be scaled versions of components of 
the latent distributions Q and Q. Hence, they need not be optimized 
directly in a computationally intensive manner and might even be stored 
more efficiently since they inherit some of the factorization properties of 
the posterior distributions on latent variables. 

4. Latent Decision Rules 

Once the iterative algorithm converges and updates cease changing 
the distributions, we have our final latent MED solution P(0). We 
propose three possible ways of using this final P(0) distribution in a 
decision rule to classify novel exemplars. 

Definition 5.1 Given the final configuration of the iterative MED so- 
lution , P(0) the classification rule for a novel exemplar Xt+i is found 
by imputing either the mode of the distribution over parameters 0* = 
argmax0P(0) or the mean 0* = f P(0)0 dO into the discriminant 
function: 



VT + 1 = sign log 



Em P(m,x T+ i\e + *) 

E n P(n,X T+1 |0-*) 



+ b* 



or by computing the following approximation of the discriminant func- 
tion with appropriate expectations on the terms within the summation: 



( Y jP(0)lo g P(m,X T+1 |0+)d© r 

yr» = sien ( log £ e J F (e)HPK^ t , |e- )Je + J ne)6de 

The above approximate solutions retain the efficiency of a maximum 
likelihood based classifier during online usage in testing scenarios. More 
elaborate decision rules are also feasible, including transductive variants 
[94]. For transduction, we may consider using the new exemplar Xt+i as 
a data point in our training set and estimate its label as a hidden variable 
with an augmented MED solution distribution of the form P(@,y T +i) 
but this is beyond the scope of this chapter. We next elaborate in 
detail the latent MED algorithm for a mixture of Gaussians, which, 
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surprisingly, reduces to the iterative solution of a support vector machine 
optimization problem. 

5. Large Margin Mixtures of Gaussians 

While it is possible to consider a mixture of arbitrary exponential 
family distributions, we will assume that 0 + is composed of M Gaus- 
sians and 0“ is composed of N Gaussians. For the mixture of Gaus- 
sians model, the positive class model expands as 0 + = {a,0 +..., 0 +} 
which includes the mixing proportions as well as the parameters for each 
Gaussian (our emission distributions in the mixture). Meanwhile, the 
negative class model is 0” = {/?, . . . , 9^} which contains the neg- 

ative class mixing proportions and Gaussian distribution parameters. 
The Gaussians are ^-dimensional continuous distributions in the space 
of X and we further simplify the problem by assuming the Gaussians 
have identity covariance and only their means are variable. Therefore, 
for any positive Gaussian, we have P(X\9^ n ) = J\T{X\9^I) and for any 
negative Gaussian we have P(X\9~) — N(X\9~,I). To compute MED 
projections, we need prior distributions over the parameters. For sim- 
plicity, we will choose a Gaussian prior on all the mean parameters 0+ 
and 0~ such as a, white noise distribution with zero mean and identity 
covariance. For all m = 1..M positive class priors and all n = 1..N 
negative class priors we therefore have: 

P°K)=Af(0+\d,I) P\0-)=N(0-\V,I). 



Similarly, we need priors for the multinomial mixing proportions a and 
/3 which are given by the conjugate Dirichlet distributions: 



TT 

y } n II 



a 



4>n 



p 0 // 3 \ _ TT Qipn-l 



The prior for the bias is also a zero- mean Gaussian with variance a: 
P°(b) = Af(b\0,a) 



1 

e 2<j 0 . 



y/2na 



Recall that our discriminant function is the log ratio of these mixtures 
of Gaussians: 



£(X*;0) = log 



T£ sl a m Ar(X t \0+) 

En=lPn^(X t \9n) 



+ b. 



We now assume that we are given a current estimate of Q, Q and 7, 
and wish to update the parameter distribution P z+1 (0). The following 
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details the implementation of the update. This will estimate an updated 
MED distribution over M positive Gaussian models, N negative-class 
Gaussians, two Dirichlet distributions (for the positive and negative class 
mixing proportions) as well as a Gaussian over the scalar bias parameter. 



5.1 Parameter Distribution Update 

We now explicate the first update rule for improving our distribution 
over parameters P*(©, 7) given the current setting of Q, Q,y, namely 
the distributions over latent variables and the margins. The solution to 
this MED update is given by maximizing the negated logarithm of the 
partition function: 

^tj ®tj ( m ) *°8 a mP(^t l^m ) _ En Qtj( n ) ^°8 0nP(^t I @n )“l _ ^) ~lt ] ^0 



Expanding the integral, we note that it can be written as 

Z( A) = Z Q+ (\)Z @ -(\)Z b (X)e^- Xt ^. 

The integral involving the bias term is straightforward to solve by inte- 
grating over the zero-mean Gaussian prior on the bias: 

Zb(\) = f f e~ Xt j ytb db = . 

Jb V 27 TO- 

Moving to model parameter distributions, it suffices to show how to 
compute Zq+( A), since the derivations for z e- (A) are equivalent and 
symmetric. We see that each positive and negative model component of 
the partition function factorizes into 



M N 

Z@+( X) = Za( X) IIVW z e -w = Ze(\)Hz,-(\) 

m — 1 n=l 

where, more specifically: 



z a ( A) 







x tjvt E m Qtj(m) log a ™da 



f^ZrnM 

J a rim ^(^m) 



M M 

n <- _i n 



E u *tjytQtj(.m) 



da 



m=l 



f(E m M n m r (&* + Etj MmQtM)) 

rim r (^m) r(E] m 4>m + Y,tj ^tjUtQtjim)) ' 



The above integrals were solved merely by noting the normalization 
property of the Dirichlet distribution. By symmetry, we have the fol- 
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lowing for the negative mixing proportion model’s partition function: 

z m = HE- M n " r (*" ~ 

’ n„ r(fc) r(£„ *«»<&,(>*)) ' 

To obtain the contribution of the positive Gaussian mean parameters we 
have: 



JLYJ. IVi n 

n Z 0 +{\) = JJ + pO( 0 +) e ^tj^ytQtM)^N(X t \g+) de + 

n=l m= 1 Jo™ 

M 

= ( 2 -r 

[ P°(6 + )e~^ t j Xt i v*2«(»»)||Xi-fl+|| 2 rf0 + 

J °t 3 



m — 1 
X 



At this point, we focus on the integral above. It has the following general 
form and can be easily simplified by completing the square: 



[ P 0 ( 6 )e~ 5 

Jo 



E* w t\\ x t-o \\ 2 dQ 



= I£ 



qTq 

2 \ T,t w t\\ X i- e \\ 2 d6 



)D/2 



exp 






—D/2 



Inserting the completed integral into the partition function formula yields 

n% = nw-*^***™ x 

mm \ tj ) 

-2 T,tj^t j ytQt j {m)X t Xt+z—^ 1 +Sti Ay nQijim) 



Similarly, we have the following contribution to the partition function 
from the negative Gaussian models: 



IK- 



-D/2 

JJ (27r) § ?*•>' Xtj yt {n) [ 1 - E £ Vt ( n ) 

tj 

iy \,-v l O,-(n\X T X 1 I 1 j ,(n)x t x f 

,2 y t QtM) x t Xt+ 2 1-E tj^tjvtQtjW 



All the above can be grouped into one large concave objective func- 
tion J( A) = — logZ(A). It is important not to despair since this large 
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objective function will soon simplify drastically into a mere SVM equa- 
tion when we note some important constraints. The objective function 
is given by: 



^( A ) — Ey A *j It 2 (Ey A ij2/*) 

+ E m log r(<M-iog r(E m 0m)+E n log r(iM-iog r(E n 
+ logr(Em ^m + Etj A 02/^o( m ))-Eml°6 r (^+Etj A y</iQyM) 

+ logr(E n ' 0 n-Etj A 0-2/tfiti(n))-Enl°6 r (^-Etj A y2/* Qtj (n)) 

+ § Ey m A 0‘ ytQtjim) log(27r)+§ Em lo g( 1+ Ey Ay ytQtj{m )) 

+ 2 Ey m *tjVtQtj(m)X t X t 2 E» m 1+Ey" 

-f E tjn x tj Vt Qtj (n) log(27r)+f En log(l-Ey A *j J/tfitjM) 



“ 2 Etjn MjytQtj{n)Xj Xt 2 E n 



23 tjt'j 1 ifVti&t 1 j'( n> > x t x t' 

1 ~T,tj *y vtQy (*0 



Not all settings of the Lagrange multipliers are permitted in the ob- 
jective function J(A) since it diverges for certain configurations. For 
instance, we note the barrier functions 



Ey ><tjytQtj(m) > -1 m=l..M Ey ^tjytQtj{n) < 1 n=l..N . 

Furthermore, we note the following constraints as well: 

E tj X tjytQtj{m) > —<f>m m=l..M Ey *tjytQtj(n) < i> n n=l..N . 

We next explicate some useful choices for the prior distributions. These 
lead to important simplifications through the barrier functions above to 
make the solution to J( A) reduce into a simple support vector machine 
algorithm. 

However, it is certainly possible to optimize the above J( A) function 
in general for any choice of priors via standard Newton-Raphson meth- 
ods or axis parallel techniques. However, these are computationally 
less favorable than the non-informative priors which will give an elegant 
SVM-like solution. For completeness, we point out that an axis parallel 
method is feasible if a single Ay is selected and its setting is updated 
to maximize J( A) via bisection search or Brent’s method. The updates 
of each Ay are performed sequentially until the objective function J{ A) 
ceases to increase. Which A t axis to update next can either be cho- 
sen randomly or according to some heuristic scheme as explained in the 
Appendix. We typically initialize all Ay = 0 during the first iteration 
of the two step algorithm and during subsequent iterations, we use the 
previous converged value of A. Since the Q, Q distributions change from 
one iteration to the next, it is possible that the previously converged A 
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result will violate some barrier functions. This problem is alleviated by 
scaling down all A tj by a single constant factor to bring them within the 
required inequalities. This ensures that the axis-parallel optimization 
begins in a valid portion of the solution space. 

5.2 Just a Support Vector Machine 

We now show that the above optimization is just a support vec- 
tor machine learning problem and can be readily solved via standard 
quadratic programming methods. This interesting equivalence emerges 
in our framework if we select non-informative priors for the bias and 
the mixing proportions. For example, selecting a non-informative prior 
for the bias distribution is equivalent to taking a — > oo as was shown in 
Chapter 3. Furthermore, we will assume non-informative priors for the 
various Dirichlet distributions by choosing 4> m 0 and — > 0. 

First note that the noninformative bias prior and a — > oo gives rise 
to the constraint J2tj ^tjVt = 0 as in Chapter 3. For clarity, define the 
vector v whose t,j'th entry is given by v(t,j ) = XtjVt- We can write the 
constraint above as an inner product of that vector with the vector of 
all ones as iFl = 0. 

Second, under a non-informative prior for the Dirichlet distributions, 
we have the inequalities J2tj ^tjVtQtjim) > 0 and J2 tj A tjVtQtj(n) < 0. 
Let us denote the distributions over latent variables in vector form as q m 
and q n . These vectors have scalar entries which are given by q m {t^j) = 
Qtj(m) and q n (t,j) = Qtj(n). Thus, we have the following constraints 
in this compact notation: q^v > 0 for all m and q^v < 0 for all n. It is 
also clear that summing the vector q m over m produces the ones vector, 
in other words: 



^ ^ Qm — 1 • 

m 

Taking the inner product of both sides of the equality above with v we 
obtain: 

V v T q m = v T l = 0. 



However, each term in the summation over m in the left hand side above 
must be non-negative, since the noninformative Dirichlet priors previ- 
ously forced q^v > 0. Thus, we see that all non- negative terms in the left 
hand side above must be zero under non-informative priors. A similar 
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argument gives Yin = 1- We now have the strict equalities: 

53 ^ tjytQtj(m) = 0 m = 1..M 

tj 

= 0 n = 1 "N- 

tj 

Inserting these constraints from the non-informative prior simplifies our 
objective function J( A) drastically. Rewriting it and removing constant 
terms, we obtain 

./(A) = 53 Atj7t - ^ 53 ytyt'\j\>j> 53 Qt]{m)Q t : 3 >{m)Xj X v 

tj tjt'j 1 m 

~ ~ 53 ytyt'MjM'j' 53 Qtj{n)Qt'j'{n)Xj x v . 

tjt'j' n 

This is merely a quadratic program and can be maximized over J( A) 
just like a traditional support vector machine. For non-separable prob- 
lems, we can maximize the above J( A) subject to the box constraints on 
the Lagrange multipliers A y G [0, c]. This would formally arise through 
barrier functions had we assumed an exponential prior on margins and 
integrated over it as in previous chapters. Furthermore, for all m = 1..M 
and n = 1..JV, we have the following additional linear constraints on the 
Lagrange multipliers in the quadratic program: Yt ■ Xtj y t Qtj (m) = 0 
and Ytj ^tjytQtj{n) — 0. These obviate the need for the constraint 
Ytj ^ tjVt — 0. Note that our rather complicated MED objective func- 
tion now looks like a simple quadratic program, very reminiscent of 
the support vector machine learning algorithm. One interesting differ- 
ence appears within the dual objective function which is the additional 
expected likelihood terms [87] computed from the latent distributions 
Ym Qtj{ m )Qt'j'{^n) and Y n Qtj(n)Qt'j'{ n ) that seem to modify the in- 
ner products Xj Xft of the Gram matrix. Furthermore, in the case where 
the number of components in the mixture models, M and iV, are both 
one, the above equation and framework reduce to almost exactly a sup- 
port vector machine. 

5.3 Latent Distributions Update 

While the update rule for the latent distributions Q, Q and the mar- 
gins 7 is straightforward given Equation 5.2, Equation 5.3, Equation 5.4 
and Equation 5.5, we still need to be able to compute the desired ex- 
pectations using our current model for the mixture of Gaussians case. 
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These formulae are explicated in the next section and can be immedi- 
ately plugged into the update rules for Q and Q. Recall that we needed 
to compute quantities such as the following for updating the latent dis- 
tributions: 

Qtj(m) oc expfp i (e)loga m P{X t \0+)deVteT p ,m = l..N. 



In this section, we elaborate the required expectations of the log a m 
mixing proportions, the log P(Xt\0^ l ) Gaussians, the log 8~ mixing pro- 
portions, and the ]ogP(X t \9~) Gaussians. Furthermore, we need expec- 
tations involving the bias to eventually recover the discriminant function 
for use in the decision rules outlined in Section 4. These expectations re- 
quire us to compute the updated P i+1 (0) using the Lagrange multipliers 
by optimizing J( A). However, we should note that for a non-informative 
prior, some distributions and expectations must actually be computed 
indirectly via the Karush-Kuhn- Tucker (KKT) conditions. 

Assume we have found the current maximizer A* of J( A) for our 
quadratic program in Section 5.2. For compactness, we will omit the 
symbol * and simply refer to our current optimal setting of the Lagrange 
multipliers as A. Using this optimal setting, it is now straightforward 
to compute a probability distribution over the models, i.e. P(0), as 
follows: 

pi+ 1( 0 ) = Qy(m)logo ro P(X t |»+)-I ^ =1 Q ti («)logA,P(X t |«-)+fc) 

The distribution over model parameters is to be perturbed from its prior 
setting by the non-zero Lagrange multipliers in A. The Gaussians form- 
ing the positive class model are updated from their prior (white Gaus- 
sian) configurations for each P t+l (0 ' t £) for m = 1..M: 



pt+l(0+) jP°(0+) e /Uj *o3/tQy(m) log J V r (X t |«+) 



( 









Here we are exploiting the fact that the SVM solution inherited the 
constraints Yltj Ayyt Qtj (m) = 0 Similarly, the updated Gaussians for 
the negative models P* +1 (0~) for each n = 1..N are given by 



^HOn) = ^Tl - y Eh j ytQtj(n)x t ,i^ . 



Here we again exploited the constraints Yltj ^tjytQtj(n) = 0 to simplify 
the formula. We can use these Gaussian distributions to compute some of 
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the required expectations. For instance, we can find the expectation over 
the \ogJ\f(Xt\0^ l ) under the currently estimated Gaussian distribution 
p (i+i) (0+) 

£{iog.V(x t |0+)} = -f iog(27T)-| xf Xt+E{xJe+}-\E{(e+) T e+). 

The first sub-component of the expectation is 

E{xf6+} = £ rj \ Tj y T Q Tj (m)X?X t . 

The second term is the trace of the correlation under the current esti- 
mated Gaussian model distribution 

E{(0+) T 0+} = 1 + T,rjr'j' \ Tj yrQrM)K' j’yr'QT’i'^r X r' ■ 

Combining both terms gives the expectation of the log Gaussian for 
positive models in detail for m = 1..M: 

E{\ogAf(X t \0+)} = -f log(27r )-%X?X t +Y, rj X Tj y r Qrj(m)X^Xt 

~ \ 'Erjr'j' KjyrQrj(m)X Tfj ,y T ,Q Tfj ,(m)X^X T ,. 

Similar derivations yield the expectation of the log Gaussian for the 
negative models n = 1 ..TV: 

E{logAf(X t \e-)} = -f log(27r)-^ t T Xt-E rj A ri y T Q ri (n)I T T Xf 

~ I ^rjr' j' ^TjyTQ T j{n)^ T t jiy T i Q r t ji{n)xT X T i . 

Next we would like to compute the updated distributions over mixing 
proportions, namely P l+l (a) and P l+l (f3). Consider the standard MED 
solution for updating a Dirichlet distribution: 

P i+l (a) (X p\ a )e^ Xt ^m Qtj(m) log a m 

___ ^ Itj ^tjytQtjjm)) y-j- (j)m+J2tj ^tjytQtj(m )~ l 

rim r (<Pm + 'Etj At jVtQtjim)) ju am 

Note that the constraint J2tj ^tjVtQtjim) = 0 misleadingly indicates 
that we do not update the distribution over P i+1 (a) from its prior con- 
figuration. It turns out that it is not appropriate to update distributions 
over mixing proportions in this way and must instead directly compute 
the required expectations. The same holds for the update of the Dirich- 
let distribution of the negative class’ mixing proportions and the update 
for the bias distribution. These distributions actually do not re main 
stuck at their prior settings. They are instead specified (along with ex- 
pectations that utilize them) via the Karush-Kuhn-Tucker conditions, 
since we assumed non-informative priors for all of them. 
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Recall the constraints on the discriminant function that emerged dur- 
ing a step of the iterative latent MED update rule: 

> 0 Vt.Vj. 

We write these constraints in terms of expectations using our distribu- 
tions over the various parameters. So far, however, we only know the 
expectations for the Gaussian model parameters. Let us proceed by im- 
puting these known Gaussian model expectations into the constraints 
which simplify as follows: 

Vt £ m Qtj{m){E{\ogarn}+E{\ogN(X t \0+)}+±E{b}) 

~ yt^n^jin^Ellog^mHEilogNiXtie-)}-^^})-^ > 0 W,Vj. 

Simplifying further by isolating the unknown expectations on the left 
hand side of the inequality: 



ytJ2m ^tj( m )( E { lo & a rn} + ^E{b})-yt'£n Qtj (n)(E {log /3 n }- ±E{b}) > 

it-yt J2m QtA™)EVogM{Xt\et)}+yt £ n Qtj{n)E{ \ogN(x t \e-)} 

Next, we define the surrogate variables a m for m = 1..M and b n for 
n = 1..N which capture the effect of the mixing proportions simultane- 
ously with the bias. We also define and readily compute the scalar values 
dtj for each Lagrange multiplier and classification constraint which cap- 
ture the right hand side of the above inequalities. The following are the 
definitions of our surrogate variables: 

am = E{\ogctm}+^E{b} m=l..M 
b n = E{\ogPn}-\E{b) n=l..N 

dtj = It-yt £ m Qt3(m)E{\ogM(X t \0+)}+yt £ n Qtj(n)E{\ogtf(X t \9n)} VtJ. 

Inserting these definitions into the inequalities gives the classification 
constraints compactly as 

M N 

yt ^2 ( n )^ ^ 

m= 1 n=l 

Next we note that a constraint is fully satisfied as a strict inequali- 
ties when its corresponding Lagrange multiplier A tj vanishes (i.e. for a 
particular datum t and latent configuration j ). Furthermore, in the non- 
separable setting when the Lagrange multipliers are clamped at c, we 
note that constraints cannot be satisfied and we give up on them. How- 
ever, for all Lagrange multipliers strictly within 0 and c, i.e. A tj E (0, c), 
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we should satisfy the classification constraints with equality: 

M N 

yt X Qtj{ m ) a m, ~ Vt X Qtj(n)b n = d tj Vt,Vj when X tj € (0,c). 

m — 1 ra=l 

The above are the by-products of the KKT conditions and we can solve 
for the desired a m and b n variables from the simple linear system of 
equations. Note that all a m and b n variables form a vector which is mul- 
tiplied by the matrix of ytQ concatenated with —ytQ- A pseudo- inverse 
or singular value decomposition then produces a m for m = 1..M and b n 
for n = 1..N. It is then possible (although not necessary) to recover the 
expectations E{loga m }, E{log/3 m } and E{b] from the values for a m 
and b n . 

Therefore, in the non-informative case, we do not explicitly repre- 
sent distributions over mixing proportions or bias but only compute 
expectations that involve them. That is all that is needed to perform 
classification or to update our latent distributions Q and Q. Thus, we 
have effectively completed the first step of our update algorithm and 
improved our estimates of the distribution over all the model parame- 
ters given by P z+1 (0), or, equivalently, updated the expectations that 
use these distributions. Furthermore, when using the decision rules in 
Section 4 it is also possible to perform classification directly with the 
a m and b n values we have recovered instead of directly representing the 
mixing proportion and bias distributions. Thus, the estimated model 
probabilities and expectations are used to refine the current setting of 
Q, Q distributions and the 7 by directly using Theorem 5.2. Thus, we 
have effectively recovered both update rules in the iterative latent MED 
formulation and can readily iterate the algorithm and eventually use the 
solution for discriminative classification with a mixture of Gaussians via 
one of the proposed decision rules. 

In Figure 5.4, we see the result of running the above iterative latent 
MED formulation for a mixture of 2 Gaussians for the positive data and 
2 Gaussians for the negative data following the initial example problem 
we posed at the beginning of this chapter. We arbitrarily chose c = 10 
for the regularization constant. The result shows that the latent MED 
solution obtains the optimal decision boundary. This involved approx- 
imately 20 latent MED iterations and sometimes considerably fewer, 
depending on the random initialization we used. There were some local 
minima, yet these were easily avoided when we annealed the Q and Q 
distributions. The figure shows the resulting decision boundaries that 
emerge for the two-class dataset and also shows the location of the Gaus- 
sian means. It is interesting to see that the Gaussian means are close 
to the origin and not where we originally hypothesized them to be in 
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Figure 5 .< Latent MED Result on a Mixture of Gaussians Discrimination Problem. 



the introductory example. This is due to our choice of a white Gaussian 
prior on means. Nevertheless, the constraints can be met with large 
margins at this configuration and the solution that emerges would have 
been very awkward to find without a mixture modeling approach (stan- 
dard SVMs with nonlinear kernel methods such as polynomial kernels 
and radial basis function kernels would find very different solutions from 
the 4 horizontal strips that emerged for this problem). 

5.4 Extension to Kernels 

Throughout the above treatment of the mixture of Gaussians discrim- 
ination problem, all computations involved only dot products between 
data points. Therefore, we can readily employ the kernel trick used in 
traditional support vector machines by replacing all dot products with 
kernel evaluations. This kernelization of the large margin mixture of 
Gaussians allows us to also explore mappings to Hilbert space giving 
more flexibility than a mere hyperplane. We replace all inner products 
by a kernel evaluation assuming our data is mapped to Hilbert space via 
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the function $(X) acting upon an input datum (which no longer needs 
to be in Euclidean space): 

K(X u X t ,) = 

Since evaluating Gaussian likelihoods only involves inner products be- 
tween the Gaussian’s mean and a datum, the discriminant function is 
also readily kernelizable and yields more elaborate nonlinearities in the 
classifier beyond those emerging from the mixture model alone. Fur- 
thermore, such a kernelization of the latent MED framework does not 
compromise its convergence properties and the two-step iterative algo- 
rithm effectively remains unchanged under any choice of the kernel. 

5.5 Extension to Non Gaussian Mixtures 

In the above treatment of mixtures we focused on Gaussians. We can 
extend the iterative latent MED framework to mixtures of non- Gaussian 
distributions as long as they remain within the exponential family. Re- 
call that non-latent MED integrals in Chapter 3 were solvable in closed 
form for any exponential family distribution. The latent MED formula- 
tion split our mixture of Gaussians into isolated Gaussian expectations 
over log-Gaussian probabilities. In other words, log-sums were replaced 
by quantities like: 

Ep (9i) {Q«(m)logPPC,|e+)}. 

This makes it straightforward to solve the MED projection when we 
replace the P(X |0+) and P(X |0“) with any exponential family dis- 
tribution and its conjugate for P(0 m ) and P(0 n ) since these types of 
expectations and the MED projections are straightforward for the ex- 
ponential family. This permits us to compute partition functions, solve 
the required parameter distribution update rule, compute the expecta- 
tions needed to update Q and Q and so forth for many other mixture 
distributions. These details are omitted for other exponential families: 
the algebraic derivations should essentially mirror those in the Gaussian 
case. 



6. Efficiency 

At this point, we can take a step back and see if we can further sim- 
plify the J (A) equation and other updates in our iterative latent MED 
approach. This is particularly important, since optimization of latent 
MED involves more than just T Lagrange multipliers but potentially 
NT p + MT n Lagrange multipliers. This may be inefficient for certain 
latent models. We show that it is not necessary to consider the joint op- 
timization over all Lagrange multipliers since many of them will be highly 
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constrained if we only update them in a sequential manner. We shall 
stick to the mixture of Gaussians formulation here but it is straightfor- 
ward to apply this efficiency result to latent graphical models in general. 

Consider only updating the Lagrange multipliers A u j j = 1..N and 
A v j j = 1..M where y u — -f 1 and y v = —1. This is similar to the se- 
quential minimal optimization (SMO) approach of Platt [147], where a 
simple algorithm for updating two Lagrange multipliers corresponding 
to differently labeled data points is used iteratively to avoid directly 
solving the SVM’s large quadratic program. In our case, however, the 
u and v corresponding to a positive datum and a negative datum each 
select a set of Lagrange multipliers since the latencies introduces extra 
MED classification constraints. It will then become clear that the entries 
across j — 1..N in A u j and across j = 1..M in A v j are not independent 
but highly constrained. In fact A u j will emerge by a scaling of the la- 
tent distribution over hidden variables given by (\ v and A v j will involve 
scaling q u . Therefore, we do not need to explore the optimization over 
all Lagrange multipliers corresponding to all hidden states. This alle- 
viates the larger SVM optimization problem we obtained when we had 
to induce additional virtual constraints in the latent MED formulation. 
In a sequential optimization, the additional Lagrange multipliers can be 
updated analytically across the latent configurations instead of requir- 
ing a large brute-force quadratic program. A similar strategy was also 
leveraged by [1] in a support vector machine framework with intractable 
numbers of Lagrange multipliers. We now explicate how to avoid consid- 
ering the full joint optimization over all Lagrange multipliers. Further- 
more, various storage efficiencies will be elucidated to efficiently store 
the many Lagrange multipliers in the latent MED problem. 

Recall the definition for q* and q* from Equation 5.2. 

exp ( Ji*(9 ) logP(m,X t ie+)de) 
£ ro exp(/i*(0)logP(m,X t |0+)d0) 
exp(/P*(0)logP(n,X t 10-)d0) 

E n ex P (/P*(©) lo gP(n,X t |0-)d0) 



q t (m) = 
Qi( n ) = 



Assume we have an objective function J( A) arising from any mixture 
model which we want to maximize subject to constraints. This entails 
the following constraints on the Lagrange multipliers in addition to non- 
negativity: 



Y2hjytQtj(m) = 0 m = 1..M 

tj 

yi ^ tjyt Qtj ( n ) — 

tj 



0 n = l..iV. 
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We rewrite the constraints above by recalling the definition Q and Q 
and simplifying. The first set of constraints yields: 

Y Y/ ^ = Y* Y, A 

t&Tp j teTn j 

Y Y A «*3U( m ) = Y Y x ri S ( m = j) 

teT p j ter„ j 

Y x T^t = Y Xt - 

teTp teT n 

The second set of constraints yields: 

£** = 
teT p teT n 

Here we have introduced the notation A t to denote a vector of Lagrange 
multipliers. These are all Lagrange multipliers arising from tf’th data 
point as we consider all latent configurations to enforce the latent MED 
formulation’s additional constraints. These are grouped into a vector A t 
containing all the Lagrange multiplier configurations as we vary the j 
index. In other words, the vectors are concatenations of the following 
scalar values: 

A* = [A$i . ■ • A tj... Atjv] T t €T P 
A t = [An • • • A tj . . . A im] t £ T n . 

Next, assume that we lock all Lagrange multipliers or axes in the op- 
timization except for A u j V? and A v j Vj where y u = 1 and y v = — 1. 
These correspond to the vectors of Lagrange multipliers X u and A v , re- 
spectively. For short, also define the set of points T p as all positive data 
points T p without u and T n as all negative data points T n without v. Let 
us write the constraints in terms of these two subsets of the Lagrange 
multipliers assuming all others are fixed: 

Y X T^t + A^lq„ = ^2 At + A„ . 

teTp teTn 

Isolating A v , we immediately see the following important relationship: 

Xv = (All) q„ + ^2 Aflqt ~Y Xt 

teTp teTn 
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It is clear that the update for the X v can only move along one degree 
of freedom, basically as a scaling of the current distribution of latent 
variables implied by q u plus the constant term R uv from the square 
brackets. The amount of scaling is given by Ajl. Similarly, we have the 
following constraint for X u : 



X u 




53 \Jlq t - 53 ^ 

£G7n tGTp 




C[y + Rv 



Once again, this vector of Lagrange multipliers is only a scaling of the 
latent variables implied by q v plus some fixed quantity vector given by 
the constant term R vu . Multiplying either of the above vector constraints 
on both sides by a vector of ones yields the scalar constraint 



(xjr) + X>7r = 

teT P 



(x^f) + 53 • 

t£ln 



This indicates that we can optimize only over two linearly coupled scalar 
quantities to update X u and X v and increase J( A) incrementally. Let 
us define the two scalar quantities of interest as follows: u = A^l and 
v = Ajl. The constraint coupling the two scalar variables is then: 

u+'^xji = v + Xjl . 

tG7p ^G7n 



Modifying u and v in this way ensures that we will never violate the 
equality constraints and effectively limits us to one true degree of free- 
dom, as in SMO [147]. We only need to modify the value of u since v 
is related to it by the above formula. Of course, we still need to ensure 
that none of the Lagrange multipliers in the original X u and X v violate 
the box constraints [0, c] as we vary u and v. These box constraints 
will translate to constraints directly on the two scalars, clamping the 
effective range that u and v can explore. In fact, we can distill the box 
constraints to just a range of allowed u values since this only one degree 
of freedom is left in this iterative optimization scheme. 

Note the box constraints interacting with u : 

0 < uq u + 53 - 53 ^ - c 

tG7p ^G7n 

We can manipulate the above to find the following succinct constraints 
on u , which we enumerate with the index m to explore all entries of the 
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many Lagrange multipliers and latent variable configurations: 



& ^ ^min 



^ < ^max 




We also need to consider the constraints on v and propagate their impact 
on u. These give: 



0 < vq v + ^2 Aflq< - y~l h < c, 
teTn teTp 



which is again manipulated as follows: 



v > v 



mm = max 
m 



E At(m) _ 

q u (m) ^ t • 



{teTp 



teTn 



q u(m) 



V < Vr> 



^ ''max = mm | c 

m 



A t(m) 



ter p teTn 



q tH 



+ V- 

^ q u (m) ^ { q«(m) 



Ultimately, w is constrained to the following scalar interval from the 
intersection of all the different constraints on the individual Lagrange 
multipliers: 



u > min { « min , v min - E Xf 1 4 - ^2 X[l 

ter P ter n 



u 



< max ^ w max , £ max - ^2 1 + f * 

teTp teTn 



We can now write the J(A) objective function for a mixture model in 
terms of only the single scalar value fi, and update it by itself. This 
then identifies the update for fi, and ultimately yields the overall update 
for X v and X u . Maximizing the single variable u can be done in closed 
form if the resulting function is a quadratic, or by bisection search while 
maintaining the above constraints on u. This efficient optimization of the 
Lagrange multipliers holds for any mixture model, since it emerges from 
the constraints on the Lagrange multipliers that arise for all mixtures. 
We next explicate the sequential update rule specifically for mixtures of 
Gaussians. 
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6.1 Efficient Mixtures of Gaussians 

In this section, we elaborate the sequential update rule in more detail 
for the mixture of Gaussians case where the objective function reduces 
to a single scalar update of u. Recall the objective function over all 
Lagrange multipliers and write it in the following vector and matrix 
form: 



JW = Y%Vt - lYsVtytNQtAt'XjXt' . 

t z t,t> 

It is now appropriate to look more closely at the matrices Q that the 
A t vectors interact with. Four different cases need to be enumerated 
depending on the configuration of the binary labels yt and y # : 

(yuw) £ {(+1, +1), (-1, -1), (-1, +1), (+1, -1)} . 

Let us see in detail what one of the Q t 7 t' matrices looks like, for instance, 
when both labels are positive: 

Qter p ,t'er p ( j , j') = Y ^ ( m ) i' ( m ) + Y ^ ( n ) ? ( n ) 

m n 

= qfq v + Y = n )^ 7 = n ) 

n 

= qfqt' + <5(j =/) • 

If when t and t' had different labels, we would obtain: 

QteTnj'eTpUij') = Y, + Y^j(n)Qt'f(n) 

mn n 

= Y w = m ) q *' ( m ) + Y 6 w = n ) 

m n 

= q<'(i) + qt(/) • 

Omitting the rest of the required algebra, we now enumerate the possible 
matrices for Q t j and note their dimensions: 





(q^qt' + i)/ 


G 


R NXN 




(qfq*' + 1)/ 


G 


x M 


Q teT n ,t'€T p = 


iq I + q 


G 


jjMxlV 




iq T + qt'i T 


G 


r N xM 
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These allow us to rewrite our maximization problem over J( A) as 

— 2 ^teTpYlt' X? X t t 

- I E te r„ E ( '6T n (qf q t ' + l) VA t ,X t r X t , 

+ I ^ZteT n St'eTp (A? Iq^'A/;' +V q^ l T A t ' ^XfX t i 
+ * E t€ r p Ee^rJVlqfV+Af q t ,FA t ,)xfX t , . 

The above can the be written in terms of the two Lagrange multiplier 
vectors we are updating in this iteration X u and A^. Manipulating this 
further, we would ultimately recover a quadratic function over the single 
scalar u by itself, which is solvable in closed form by taking derivatives 
with respect to u and setting to 0. The algebra is omitted but is straight- 
forward. If the analytic update of u puts us outside of the range of the 
inequality constraints on u, we merely clamp the value of u such that it 
is pulled back into the valid range. This process _is iterated as we ran- 
domly select Lagrange multiplier vectors X u and X u to update at a time 
until the overall objective function stops increasing. We then switch to 
updating Q and Q as in Theorem 5.2, interleaving it with sequential 
updates of Theorem 5.1 to eventually converge the overall latent MED 
formulation. 

7. Structured Latent Models 

While the above sequential efficient strategy is helpful in the regular 
flat mixture case, it is even more crucial in models where there is a much 
larger configuration space of hidden variables. For instance, recall the 
hidden Markov model from Chapter 2. This model can be seen as a 
mixture model with an exponential number of states corresponding to 
all possible paths through its state trellis. In such cases the efficient 
computation approach in the previous section is indispensable. 

Similarly, in the case of general Bayesian networks, we may again have 
latent variables that create many configurations in the latent problem. 
Therefore, we would again introduce additional Lagrange multipliers 
that constrain the Jensen-bounded terms for the correct model in the 
discriminant function to be larger than each possible setting of the latent 
variables in the incorrect model. It is easy to see that in these cases a 
brute force approach would create a huge dual space optimization prob- 
lem over an intractable number of Lagrange multipliers A. However, this 
need not be an intractable problem, and we discuss various efficiencies 
in general latent graphical models that maintain a tractable latent MED 
formulation. 

One key intuition is that we do not need to store or update all the 
Lagrange multipliers exhaustively. If we assume non-informative pri- 
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ors, these many Lagrange multipliers can only updated by scalings of 
the corresponding posterior distributions on latent variables as shown 
in Section 6. The Lagrange multipliers will therefore inherit the fac- 
torization properties of the latent variable posterior. This makes their 
optimization highly constrained, and also makes storing and manipulat- 
ing them efficient using graphical modeling tools, such as junction tree 
algorithms, conditional probability tables, and so on. 

In general graphical models, the latent distributions should no longer 
be described as a flat mixture over components with generic tables P(m) 
or P(n). We instead assume we have highly structured factorized latent 
distributions for the positive and for the negative models. These latent 
variables and their distributions could be characterized by a directed 
graphical model implying the following highly structured factorizations: 

U V 

P(m) = JJP(m t *|m 7r .) P(n) = n p (*,). 

1 i — 1 

Thus, each latent configuration m is really composed of U latent vari- 
ables m = Similarly, the latent distribution P{n) has 

a configuration space n that is really composed of V different latent 
variables n = {ni, . . . , ny }. These distributions are compactly specified 
in their conditional forms, where each node rrii (or n*) is conditionally 
independent from the others given its parents m 7ri (or n^.) in the graph- 
ical model. In addition, the observed data variables X will also only 
depend on subsets of the latent variables, instead of all of them as was 
the case for mixture models (where we had generic P(X\m) or P(X\n ) 
dependencies). For instance, recall the hidden Markov model depicted 
in Figure 2.4, where we had a structured distribution over discrete latent 
variables m of cardinality M and the observation sequence X t of length 
U : 



u 

P(m,X t |©+) = P{mi)P{X t ,i\mi)YIP{Xt,i\mi)P{mi\mi-i) . 

i = 2 

Here we have omitted the dependence on 0+ for compactness and changed 
notation to avoid conflicting symbols. In a discriminative classification 
scenario for classifying two classes of strings or sequences, the above 
hidden Markov model can be used to represent positive data, while an- 
other hidden Markov model jP(n,X*|0_) represents negative data. Here 
n contains a total of V hidden discrete variables of cardinality N as 
n = {ni,...,ny} which are also endowed with a graph structure (in 
this case a Markov chain). We then consider using the log-likelihood ra- 
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tio of such hidden Markov models in the standard discriminant function 



£{Xf, 0) = log 



E n P(n,x t \e-) 



+ b. 



Since Xt will always be observed, let us focus only on the factorized or 
structured distributions over discrete latent variables P(m) and P(n). 
Denote the actual scalar entries for these factorized probability distri- 
butions by multinomial tables of parameters a^ ni and respectively. 
These are essentially standard multinomial distributions when we con- 
dition on the parent variables. 

We now follow an approach that essentially mirrors the previous flat 
mixture of Gaussians case at the beginning of the chapter. The flat 
mixture re-emerges as a subset if we assume only a single hidden variable 
per class and take the parents of the distributions to be the null set, 
ni = {}. In the HMM case, we have m 7ri = and n 7Fi = n{-\. 

Consider computing the latent MED projection in Theorem 5.1 for P(@). 
For each table a m and f3 n and each configuration of the parents of 
the multinomial distribution, we introduce conjugate Dirichlet priors 



p 0 (^) 

P°( A|J 



r(^mj ^mi) 

n mi 

r(£^) 

U ni 



rrii\m 7Ti 




7%i 



a 



X,“l 



F(XXnj fimj) tt 

TT oV’ni - 1 

UnA^i) nilni - 1 ' 



For a stationary hidden Markov model, P(rrii\mi-i) and P(n;|n^_i) stay 
constant for any choice of i. There we would only have a total of M 
and N of these Dirichlet prior distributions to consider in MED and to 
refine into estimated posteriors. There are also the emission distributions 
which are straightforward to handle and will not be elaborated here. The 
latent MED formulation then directly applies using the updated from 
Theorem 5.1 and Theorem 5.2. We can compute the standard vectors 
containing posterior distributions over latent variables stored in q t (m) 
and q*(m). One important property of these posterior distributions is 
that they inherit the factorization on P(m) and P(n) as in the EM 
framework. We then have the following factorizations: 

U u 

qi(mx, . . . ,mu) = 

i = 1 i = 1 

V V 

qt(ni,...,ny) = JJqt(nj|n^) = q t (nj|nj_i). 

i — 1 i — 1 
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We also have the corresponding latent distributions for the repeated 
constraints over all latent configurations which we again denote by Q 
and Q. These are given as before but are indexed by a much larger set 
of configurations for m and n. Denote the space of configurations of the 
m variable as M and the space of configurations of the n variable via 
Af. We then define the posterior distributions on latent variables as 



Qtj(m) 

Qtj {n) 





Vi € T p , 


j = l.W 


<5(m = j) 


V<GT n , 


j = 1..M 


3 

li 


Vi G T p , 


j = l.JV 


q«(n) 


Vi G T n , 


j = 1..M 



where both Q and Q would also inherit the structured factorization 
properties of the original latent graphical model. Similarly, for every 
datum we have a set of Lagrange multipliers given by A tj for j = 1..AT 
when y t = +1 or j = 1..M when y t — — 1. These correspond to the con- 
straints ensuring that the Jensen bound on the correct model overtakes 
the incorrect model for all settings of its latent configurations. 

Subsequently, if we assume non-informative priors on the bias and on 
all our Dirichlet distributions by letting <j> mi -» 0 and rp n . — > 0 for all 
mi and n*, we again notice a remarkable simplification. The objective 
function in the MED framework (or equivalently the partition function) 
diverges unless the following conditions hold: 



^ v hjytQtjirn) — 0 Vm^, 7r mi 
tj 

^ > ^tjVtQtjjn) — 0 Vn^, 7r ni . 
tj 



In the particular case of the stationary hidden Markov model, the pa- 
rameter distributions are shared across many configurations, and the 
following constraints on the Lagrange multipliers emerge: 



XI *tj yt X Qtjim = k\m,i-i = k') 
tj i 

X Xt i yt X Qtj( n i = k \rii-i = k') 

tj i 



0 k = 1 ..Af, k' = 1..M 
0 k = 1..N, k' = 1..N. 
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Manipulating this further and denoting the set of positive inputs T p and 
the set of negative inputs as T n gives 



y A tjQtjim - k\mi-i = k ') = 

teT p ,j,i 

y, = k\m,i-i = k ') = 

t<=T p ,j,i 



= 



y - k\mi-i = k!) 

teTn,j,i 

y y^tjS{m=j) 

te.T ni j,i j 

£ x < 

te.T n 



where we defined the distribution Qt as being the stationary Markov 
chain corresponding to the non-stationary posterior distribution q t as 



u v 

Q t {m) = where Qt(mj|mj_i) a y qt(mj|mt_i). 

i=l i=l 



Similarly, we obtain a set of constraints for the negative models: 



yz* = y%iQt, 

teT p teT n 



where we defined Qt as the stationary Markov chain corresponding to 
the non-stationary posterior distribution q t as 

v _ V 

Qt(n) = f[Qt(«il n i-i) where Qt(n»|ni_i) oc y qt(»i|nj-i). 

»= 1 ' *=i 

If we consider using sequential optimization, as in the efficient mix- 
ture of Gaussians case, we note the re-emergence of the following highly 
constraining update rules on pairs of Lagrange multiplier sets A„ and A„ 
when y u = +1 and y v = -1. As in the previous section, we have: 



X v 



= (aJ i) Qu + 


yxjiQt-yXt 


teTp 

r 


t(zTn J 

“i 


= (A^l) Qt, + 


£ ^ - £ x * 






teTp J 



Once again, we note that the vector of Lagrange multipliers correspond- 
ing to each datum At has a highly constrained structure and cannot be 
updated arbitrarily. In fact, the vectors of Lagrange multipliers are best 
thought of as scaled distributions that inherit the factorized structure of 
the posterior on latent variables. 
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8. Factorization of Lagrange Multipliers 

In this section, we note that the MED Lagrange multipliers can be 

expressed as a linear combination of the factorized latent distributions 

and thus inherit factorization properties themselves. This permits us 

to efficiently store them as scalings that multiply posterior distributions 

over hidden variables. For instance, recall the efficient sequential update 

method above. Therein, the Lagrange multipliers emerged as scalings of 

the various latent vectors or posterior distributions such as q u and q„ 

or, for stationary HMMs, Q u and Q„. In the case of latent graphical 

models, the latent posteriors q„ and q„ factorize according to a graph 

allowing us to efficiently evaluate, store and manipulate these quantities. 

Next, consider each scalar Lagrange multiplier vector entry A u (n) of the 
— * 

vector \ u for a given datum u eT p . Prom our update rule, we can simply 
write the Lagrange multiplier as a linear combination of the factorized 
posterior distributions q t(n) on latent variables for the negative class 
data as follows: 



A u (ni, . . . , n v ) = 53 II 3*( n iK<) 

t£T n i=l 

where the scalars l u t for t ET n are variables that determine the scaling 

on the posteriors. The scalar Lagrange multiplier vector entry X v (m) of 
—+ 

a vector A„ for v € T n can similarly be written as a linear combination 
of the factorized posterior distributions q«(m) on latent variables for the 
positive class data as follows: 

u 

\ v (mi,...,mu) = 53 

teT p i = 1 

where the scalars l v t for t E T p are variable and determine the scaling 
on the posteriors. Thus, we can evaluate, store and update the La- 
grange multipliers for exponentially large configurations of latent vari- 
ables by merely representing linear combinations of the highly structured 
Bayesian network posterior distributions. It should be noted that the 
Lagrange multipliers need not integrate to unity but should be non- 
negative. Other structures like R uv or R vu can also be efficiently manip- 
ulated using this method. In fact, we anticipate such graphical model 
constraints on Lagrange multipliers in dual optimization may have other 
promising applications in general. 

There remain some efficiency issues for structured models. Clearly, 
computing Q and Q remains efficient as in any EM algorithm. By ef- 
ficiency, we mean that all operations in the latent MED formulation 
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are of comparable complexity to the E-step in a latent graphical model 
EM algorithm. Meanwhile, the entropy terms H(qt) and H(qt) are 
also efficient if the distributions are factorized. For example, computing 
the entropy of a Markov chain is straightforward. This is because the 
summations distribute over the log-conditionals as 

u u 

h (< i ) = -En q (^)^n q (^) 

m i — 1 i—l 

U 

= ) log q( m i I TTmj ) • 

i—l 

Similarly, computing the expected-likelihood terms that appear in the 
objective function J( A) is also straightforward for latent graphical mod- 
els: 

u 

- £11 qt(™*Km i )qt'(»rat|7rm i ) 

m m i—l 

U 

m i — 1 



Here we have replaced the product of the pairs of conditionals with gen- 
eral clique potentials and only need to run the junction tree algorithm on 
the resulting undirected graph to compute its total. Once the algorithm 
settles (after collecting and distributing), the total in any clique equals 
the expected likelihood (since the above is not a normalized probability 
distribution). Similarly, we compute \[ 1 by finding the normalizer of 
an undirected graphical model via the junction tree algorithm. 

Even computing the min and max over the configurations to determine 
valid step sizes in the sequential optimization remains computationally 
efficient. Recall that we need to check the following types of expressions 
over all latent configurations to ensure that Lagrange multipliers satisfy 
our box constraints : 



u 



min 



max > 

m ^ 

t£7n 



A t(m) 

<lu{m) 




q tW 

q«(ra)' 



Here the terms being maximized over m also have a factorization struc- 
ture allowing us to use the junction tree algorithm with max operations 
instead of summations to find the largest entry. Similarly, we can use the 
junction tree algorithm with min operations to find the smallest entry 
to compute u m ax- 
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We thus see that we can discriminatively learn structured graphical 
models where many latent variables interact through complex yet fac- 
torized posterior distributions on latent variables. The Lagrange multi- 
pliers (for uninformative priors) inherit these factorizations and remain 
constrained and efficient to update when we deal with sequential op- 
timization schemes. In fact, the Lagrange multipliers merely explored 
scaled versions of the posterior distributions on latent variables to at- 
tain their next best configuration. Lagrange multipliers can also be 
stored and manipulated efficiently as factorized but unnormalized dis- 
tributions. All necessary computations including sequentially updating 
the sets of Lagrange multipliers, estimating the posterior over latent vari- 
ables, computing entropies, computing expected likelihood, and limiting 
the Lagrange multipliers with box constraints remain efficient, since we 
can leverage the factorizations properties of the graphical models. 

9. Mean Field for Intractable Models 

In the MED latent formalism, we may even consider intractable mod- 
els, whose latencies remain computationally infeasible despite the net- 
work’s factorization properties. This is the case when the graphical mod- 
els themselves are non-tree structures, such as factorial hidden Markov 
models [60] or loopy graphs like Markov random fields [202, 51]. Such 
intractable models have similar difficulties with maximum likelihood es- 
timation and traditional expectation-maximization frameworks as well. 
For those models mean field and structured mean field methods are often 
employed to make maximum likelihood learning tractable [75, 81, 161, 
134, 58]. Similar tools may be useful in the latent MED technique. 

Recall when we invoked Jensen’s inequality to bound the latent MED 
constraints. This resulted in the computation of the distributions over 
latent configurations Q and Q. However, these distributions may be too 
large for intractable models and have too many configurations, despite 
factorization properties. Thus, storing the exact posteriors Q or Q, 
even as conditional probability tables is impossible. It is possible to 
instead compute Q and Q distributions that are forced to factorize even 
further. This will no longer provide the tightest possible Jensen bound 
and will further shrink the convex hull of constraints from the original V 
admissible set, but we will still have a guaranteed Jensen bound and can 
iterate the latent MED framework. The advantage, however, in having 
a more factorized Q and Q is that the Lagrange multiplier vectors above 
also become tractable, storable and efficient to estimate in the sequential 
framework. We do not have to consider full factorization in a mean-field 
sense, but can consider structured mean-field factorizations. Here the 
latent distributions are not fully factorized but rather still have tractable 
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substructures such as trees and chains without loops. The resulting Q 
and Q distributions will then be estimated by minimizing Kullback- 
Leibler divergence to the true expectations of the latent variables under 
the current mixture model distribution. More details about obtaining 
mean-field and structured mean-field bounds on intractable models can 
be found in [75, 81, 161, 134, 58]. 




Chapter 6 



CONCLUSION 



A mathematical theory is not to be considered complete until you have 
made it so clear that you can explain it to the first man whom you 

meet on the street. 

David Hilbert, 1862-1943 



This text has motivated and situated two schools of thought: gen- 
erative and discriminative learning. Both have deeply complementary 
advantages yet, in their traditional incarnations, have been incompati- 
ble. We started by reviewing several approaches in each school. This in- 
cluded Bayesian methods, maximum likelihood, exponential family mod- 
els, maximum entropy, expectation-maximization and graphical models 
in the generative school. In the discriminative school, we discussed con- 
ditional likelihood, logistic regression, support vector machines and ker- 
nel methods. The various strengths and weaknesses of the methods 
suggested that a hybrid framework could be quite beneficial. This led us 
to a common mathematical framework that unites the two and marries 
their strengths. This framework of maximum entropy discrimination al- 
lowed us to connect maximum entropy with discriminative margin-based 
constraints. It spanned many important generative models allowing us 
to learn their parameters discriminatively. Other extensions were feasi- 
ble beyond binary classification and an important iterative formulation 
for latent variables also emerged. MED thus provided a principled fusion 
of discriminative and generative learning. We can now consider using 
the flexible space of generative models while maximizing their perfor- 
mance on the tasks at hand. Thus probabilistic modeling resources are 
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harnessed optimally by a discriminative criterion avoiding the interme- 
diate sub-goal of learning a good generator. The end result is better 
performance with the same rich models. 

We now review some of the tools that were covered and also point out 
ways to design and modify them for specific machine learning applica- 
tions. The last few paragraphs in this chapter then outline directions for 
future research in the intersection between generative and discriminative 
learning. 

1. A Generative and Discriminative Hybrid 

We looked at a spectrum of generative and discriminative tools in 
the machine learning community. One way to visualize them on a grid 
or road map between the generative and discriminative tools which in- 
cludes conditional learning as a half-way point. The grid also shows 
how each of these schools can also accommodate local, regularized (or 
endowed with priors) and averaged solutions. Generative approaches 
were typically characterized by elaborate models. They span many dis- 
tributions, unusual data, nonlinearities, potential latencies, graphical 
network structures and so on. Yet they could perform poorly on some 
tasks when estimated with non-discriminative criteria such as maximum 
likelihood. The discriminative approaches were characterized by more 
robust, discriminative, large-margin learning algorithms yet only have 
a limited portfolio of kernels and feature mappings to explore unusual 
data and nonlinearities. 

The maximum entropy discrimination method connects the above two 
schools by first starting with a discriminative SVM-like framework (more 
specifically, a regularization theory framework which is a closely re- 
lated cousin of the SVM). The method then injects a Bayesian flavor 
to the discrimination problem by recasting the solution as a probability 
distribution over the space of all classifiers. This is done while main- 
taining classification and discrimination constraints on the margins to 
discriminatively focus this probabilistic solution on the task at hand. 
This shows that MED is an averaging approach. The solution, remark- 
ably, was unique and readily obtainable by classical maximum entropy 
methods. The MED framework also gave rise to many elegant flexibil- 
ities. Through its augmented distributions, which included potentially 
infinite-dimensional priors and posteriors on models, margins, biases and 
other terms, we could tailor the learning problem at hand easily and in- 
troduce prior knowledge directly. Furthermore, the framework handles 
a wide range of distributions and nonlinearities while still maintaining 
a convex and unique solution. It enjoys an intuitive geometric inter- 
pretation allowing us to view various MED discrimination problems as 
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information projections. While the framework does give rise to support 
vector machines and quadratic programming when Gaussian assump- 
tions are made, we were free to consider many other distributions. In 
fact, all exponential family distributions (multinomials, full covariance 
Gaussians, and so forth) can be handled in closed form and give rise 
to general convex programs. Thus, MED combines the flexibility of 
Bayesian modeling with discriminative estimation. Algorithms for esti- 
mating the exponential family, support vector machines, Gaussian mod- 
els and multinomial models were portrayed. Empirical discrimination 
results argued that MED is more appropriate for computing distribu- 
tion parameters than traditional generative estimation criteria. Finally, 
the new framework accommodates several generalization guarantees in- 
cluding those borrowed from standard SVM arguments (VC dimension 
and sparsity) as well as PAC-Bayes bounds for averaged classifiers. 

Various extensions were easily attainable and straightforward to in- 
sert into the probabilistic MED framework. These extensions often came 
forth via the metaphor of augmented distributions which permitted us to 
cascade various estimation problems into the learning task elegantly. For 
instance, feature selection was immediately attainable from the point of 
view of a Bernoulli distribution on the features. These favor extinguish- 
ing some features resulting in a sparse support vector machine solution 
which may have otherwise been difficult to construct without a prob- 
abilistic framework. Multi-class classification was naturally accommo- 
dated via discriminant functions as log-ratios between generative models. 
We also extend MED beyond the classification domain to the regression 
domain and subsumed the support vector regression method and elabo- 
rated regression with generative models. Transduction in a classification 
and regression setting was also discussed. Performing kernel selection, 
an important aspect of support vector machine methods, was natural 
with the framework via probabilistic switches on kernels much like fea- 
ture selection. Furthermore, meta-learning and multi-task issues (which 
are still only rarely studied in SVM learning) were easily addressed by 
considering common kernel and feature selection switch configurations 
for multiple support vector machines. 

We then discussed how to extend discriminative approaches to latent 
domains. We first recognized the predominance of mixtures of the ex- 
ponential family in the domain of generative learning. Then we noted 
the potential intractabilities that arise with such latent models. Varia- 
tional bounds were motivated as an efficient and principled way to re- 
solve such intractabilities. A bound-based discriminative variant of the 
expectation-maximization was proposed which iteratively restricted the 
convex hull of constraints and repeated projections in the MED frame- 
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work. The bounds rendered latent MED computations tractable. The 
iterative latent MED formulation wwas also applicable to structured la- 
tent graphical models permitting them to be estimated discriminatively. 
We demonstrated that efficient computation of the bounds is possible 
by noting that the large number of Lagrange multipliers in the latent 
MED formulation inherit the factorization properties of the posterior 
on latent variables. Thus MED discriminative learning is applicable to 
latent Bayesian networks in general and spans a large portion of today’s 
sophisticated generative models. 

2. Designing Models versus Designing Kernels 

We should point out that discriminative approaches can indeed accom- 
modate some domain-specific knowledge yet this approach rarely has the 
ease of use that generative model design exhibits. In support vector ma- 
chines and their variants, for example, prior knowledge and extensions to 
strange types of input data are tackled via kernel engineering and feature 
engineering. While it may sometimes be natural to think about a prob- 
lem domain by noting a mapping to high dimensional Hilbert space or 
by discovering a new kernel, such an approach is not always as practical 
as generative model design. Kernel design is also often not as visualiz- 
able as generative modeling. Nevertheless, several general guidelines and 
frameworks exist for designing and optimizing kernels and deserve some 
mention. For instance, one may consider exploring a restricted class of 
kernels with certain computational properties such as convolutional ker- 
nels [66, 33]. Other examples include the large class of string kernels for 
dealing with sequential data (primarily over discrete symbol alphabet) 
as in the following efforts [114, 112, 191, 189]. Some kernels may be 
explained in terms of more general modeling tools such as transducers 
using so-called rational kernels [34] which are able to modularly combine 
kernels. It may even be possible to construct and compose kernels via 
combinations of super-kernels or hyper-kernels allowing us to explore a 
space of mappings to find the appropriate Hilbert space for our learning 
problem [138]. Unfortunately, these kernels design approaches do not 
always address or leverage the generative modeling literature. There 
may potentially be much to gain by building upon generative model- 
ing, HMM-variants and statistical tools to facilitate the kernel design 
process. 

Some specific efforts have been investigated for building kernels over 
probability models and generative models. These include the Fisher ker- 
nel which forms a generative model of the aggregated data set. It then 
approximates a kernel from the resulting statistical manifold by locally 
linearizing the Kullback Leibler divergence around the maximum likeli- 
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hood estimate [78]. Alternatively, information diffusion gives kernels by 
solving heat equations on a statistical manifold over a given generative 
model’s parameter space to approximate the geodesic on the statistical 
manifold yet this approach is limited in which generative models it can 
handle [107]. Expected-likelihood and Bhattacharyya kernels are also 
probabilistic and arise from the probability dot product between two 
generative models [87]. These kernels do apply to a wide range of gen- 
erative models. However, in all these probabilistic kernel approaches, 
generative models themselves are estimated with maximum likelihood 
or another generative criterion prior to being used as kernels. This max- 
imum likelihood step may act as a bottleneck, reducing their ultimate 
discrimination power. Therefore, the piece-meal modular approaches for 
combining generative models into support vector machines may not fully 
leverage discrimination into the generative model. 

A direct approach such as maximum entropy discrimination does ex- 
plore the wide range of generative models and exploits discrimination to 
estimate all their parameters from the outset. Thus, the intermediate 
and sometimes unreliable subgoal of generative estimation is avoided. 
One important and enduring advantage of generative models is the ease 
by which they can be designed and adjusted to capture knowledge about 
a particular learning problem or to capture expertise about a domain. 
This is in contrast to what may an awkward approach of kernel engi- 
neering that is predominant in support vector machines and discrimina- 
tive learning schools. Combining generative models with discriminative 
frameworks as in MED preserves the ease of design generative model- 
ing brings to bear on machine learning problems. It is easy to visualize 
and modify the dependencies between random variables through graph- 
ical modeling tools and to consider variations such as latent variables 
by introducing hidden nodes. Furthermore, certain variables often have 
natural choices for distributions. For instance, distributions for con- 
tinuous vector variables are typically Gaussian, discrete variables have 
multinomial distributions and non-negative variables are often modeled 
by Poisson distributions. Temporal and sequential data is typically de- 
scribed by Markov independency assumptions. These standard choices 
for distributions make it easy for the designer to build a generative 
model. Generative modeling is also more practical when we handle ex- 
tremely large problem domains with thousands of variables since the 
graphical modeling machinery permits us to consider complex Bayesian 
networks and, possibly more importantly, visualize them. It is also po- 
tentially easier to communicate such models to other scientists, offering 
a bridge between machine learning researchers and researchers in ap- 
plied domains. Finally, the MED framework makes it is easy to import 
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or borrow generative models from the large body of maximum likeli- 
hood methods into discriminative settings with minimal restructuring. 
This allows one to take advantage of the large body of work in many 
applied domains where classical maximum likelihood efforts have been 
extensively investigated and generative probabilistic models have been 
refined over many iterations and generations. 

3. What’s Next? 

Having formed a joint generative-discriminative framework, it is now 
important to explore the continuum between the generative and discrim- 
inative solutions it can produce. As mentioned in Chapter 3, through a 
regularization parameter the MED framework can interpolate between 
a purely generative empirical Bayes model to a purely discriminative 
solution. What intuitions can be garnered about the appropriate level 
of regularization? Beyond regularization parameters, what other pa- 
rameters in the framework (such as epsilon-insensitivity in regression, 
number of latent models, etc.) might have principled settings or may 
be estimable without brute-force cross-validation? Furthermore, what 
intuitions can we form about which models are most amenable to (and 
most likely to benefit from) discriminative estimation? 

Another immediate problem is the presence of local minima in the it- 
erative latent MED framework. While MED effectively eschews the local 
minima problem for exponential family models and promises interesting 
convergence properties, global or pseudo-global solutions may be within 
reach for latent situations as well. Conventional deterministic annealing 
or regularization arguments are certainly possible avenues [185]. How- 
ever, a formal treatment of latent MED that leverages both latent mod- 
eling and discrimination while maintaining convexity would ease such 
problems. Nevertheless, local minima in latent models have plagued 
almost all frameworks, including maximum likelihood and variational 
Bayesian methods so a solution here may lie possibly in reformulating 
or relaxing latent models themselves [85] . 

The framework facilitates many important extensions which demon- 
strate and prove its flexibility. While transduction, feature selection, 
latent models, and so forth have been explored, these may only be the 
proverbial tip of the iceberg. Many other interesting learning scenarios 
might await. For instance, missing or corrupted data in the input space 
may be addressed with an appropriate prior and an augmented MED 
projection. We may also consider invariants and transformations in our 
data and handle those more explicitly [169, 54, 31, 85]. Alternatively we 
may consider choosing other distributions for model priors, margin pri- 
ors, bias priors, etc. to explore the effects these would have on the MED 
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solutions. The proposed discriminative-generative framework permits 
us to explore novel probabilistic models that would not necessarily be 
practical in a purely generative setting. This may include, for instance, 
unnormalized generative models or energy based models [177]. 

The generative and discriminative estimation framework has many 
powerful applications areas that it could impact. One critical area is 
in the speech recognition domain. There, recent results have indicated 
that discriminative HMMs outperform many methods on difficult large- 
corpus recognition tasks. The discriminative variants used in the speech 
recognition community are often more heuristic, relying on approximate 
bounds and local or gradient based optimization. Furthermore, these 
HMMs are often learned with fundamentally conditional criteria and 
not necessarily adjusted for a large-margin decision boundary. The ma- 
chinery in this text appears well suited to tackle these problems given 
that we were able to handle latent models, mixture models and hidden 
Markov models in MED. In fact, the list of machine learning application 
domains ranging from bioinformatics to computer vision to information 
retrieval is simply too long to enumerate. Fortunately, this provides 
an endless array of challenging problems to explore and many potential 
clients for discriminative-generative learning. 
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This appendix provides some additional implementation details for 
optimization methods under the MED framework. 

1. Optimization in the MED Framework 

At this point, we discuss some implementation details of the opti- 
mization of the J( A) objective function in the MED framework. An im- 
portant advantage is that J(A) is concave and therefore any procedure 
that locally increases it will eventually converge to the global optimum. 
This will consistently give us the best setting of the Lagrange multipliers 
in the dual space optimization. Since consistent global convergence is 
guaranteed, we will instead focus on the speed of convergence and dis- 
cuss multiple algorithms. Some natural optimization techniques in this 
setting include Newton-Raphson, gradient descent, line search and con- 
jugate gradient descent [149]. Unfortunately, these do not always take 
advantage of the simple yet important decoupling properties in the ob- 
jective function. This limitation is portrayed initially in our presentation 
of a simple constrained gradient descent approach. This then motivates 
the use of faster axis-parallel approaches which benefit from the decou- 
pling of the objective function and only require local computations. We 
finally propose an optimized variant of the axis-parallel method which 
learns how to transition between subsets of the variables to speed up the 
training process. 

1.1 Constrained Gradient Ascent 

One possible approach to maximizing J( A) is to compute the gradients 
with respect to the Lagrange multipliers and to take a small in their 
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direction 



A+ 



= A + w 



dJ{ A) 
OX 



x- 



where A + are the next values of the Lagrange multipliers, A“ are the 
previous ones and w denotes the step size. However, this form of op- 
timization will disregard the constraints that A are non-negative. This 
can be taken care of by reparameterizing them as follows: 



v 2 t = X t . 



We can use the surrogate variables v which, when squared, form the A 
vector. Therefore, we have: 



v T w 



dJ{ X) 



du 



This maintains the non-negativity constraint on the A vector as we per- 
form gradient ascent. However, in problems where a non-informat ive 
bias is used, we also have the additional constraint: Yt ^tVt — 0- This 
can be resolved by projecting each step in the unconstrained gradient 
ascent back onto the plane Yt — 0- However, since we are operat- 
ing in z/-space, this planar constraint behaves as a quadratic constraint: 
Yt ^tVt — 0- Nevertheless, this projection is still solvable analytically. 




Figure 7.1. Constrained Gradient Ascent Optimization in the MED framework. 

In addition, to speed up the convergence, we allow the step size w 
to vary with each iteration. If the step results in an increase in the 
objective function J( A), then we take the step and also slightly increase 
w. If it doesn’t result in an increase, we do not take the step and retry 
the gradient step with w scaled down by half. 
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In practice, computing the gradients and updating the J( A) function 
is slow for many MED problems. This, compounded with the fact that 
we are constantly re-projecting onto the constraint surface, leads to poor 
convergence. One way to vastly improve the optimization process is to 
only consider updating a single A t variable at a time and only computing 
J( A) after that single perturbation. In many problems, this permits us to 
decouple the computations and effectively consider only a single datum 
at a time, speeding up each iteration considerably (often by an order 
equal to the cardinality of the data set, T). This approach is elaborated 
in the following subsections. 

1.2 Axis-Parallel Optimization 

As discussed in the previous subsection, gradient ascent types of up- 
dates may not be efficient in the MED framework since each step requires 
computations of gradients and the new objective function over all the 
training data set. However, if we only consider updating a single La- 
grange multiplier at a time, the computations only involve manipulation 
of a single data point in detail as well as some simple sufficient statistics 
that summarize the effect of the rest of the dataset. Axis-parallel op- 
timization, is similar to the notion of smallest possible working sets in 
[139] and Platt’s sequential minimal optimization [147]. The difference 
here is that the working set is a single variable and we only optimize one 
dimension while all others are fixed. 




Figure 7.2. Axis-Parallel Optimization in the MED framework. 

Axis-parallel optimization has been around for a while and has its 
advantages and disadvantages. Other than computational efficiency, an 
additional advantage in MED is due to the overall concavity of the ob- 
jective function. Thus, optimizing over a single variable at a time is 
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guaranteed to increase the objective and iterating these axis optimiza- 
tions will eventually converge to the global optimum. Figure 7.2 depicts 
the optimization in a toy 2D problem. 

In certain cases, the update for a single axis can be computed ana- 
lytically. Take for example the MED SVM kernel-based classification 
with a (informative) Gaussian prior on bias. Thus, the additional con- 
straint Ylt Vt^t — 0 (as shown in Chapter 3) is obviated and we have the 
following: 

J W — Z] t [At-H°g(i-Ai/c) ] - yt^t ) 2 -f N\ t f ytyt' K { x t,Xt')' 

The only constraints in effect in the objective function above is that 
the Lagrange multipliers are non-negative and upper bounded by c. A 
simple analytic update rule exists for maximizing one Lagrange multi- 
plier, A* , at a time. Holding all others Lagrange multipliers fixed, we 
take derivatives with respect to A t and set them to zero yield a quadratic 
equation 




where we have the following scalars to specify the quadratic equation: 

A = K{X t ,X t ) + a 2 

B = -1 -ca 2 - cK(X t , X t ) -ytY, K ( X x t>)Vt>h' + ° 2 Vt ]T y v X v 

t'^t t'^t 

C = -1 +c + cyt^2K(X t ,X t i)y t 'Xtf ~ ca 2 y t Y^yt'X t ’ ■ 

t'^t t’^t 

These two solutions to the quadratic equation are clamped so that 
A t E [0, c) and are then evaluated to see which one causes the great- 
est increase in the objective function. In certain cases, it is difficult to 
obtain an analytic update rule for a single Lagrange multiplier as above. 
We instead use Brent’s method [149], a guaranteed ID search method 
which is more efficient than bisection search. This gives the maximum 
of the objective function for the single Lagrange multiplier numerically. 

At this point, we will focus on how to choose the axes intelligently in 
the axis-parallel optimization. Typically, in axis-parallel optimization, 
we iterate by randomly selecting one axis from the T possible choices 
(if the optimization of J(A) is T-dimensional) . Eventually, the objective 
converges and we cease optimizing with a simple heuristic stopping cri- 
terion. Optimization is often fast and MED classification with hundreds 
of data points takes just a few seconds. We next discuss a more efficient 
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strategy than random selection which can bring convergence improve- 
ments of about an order of magnitude (which can be important in large 
data set problems or, for instance, in high-dimensional feature selection 
problems) . 

1.3 Learning Axis Transitions 

While the previous approach of randomly selecting an axis and maxi- 
mizing it in isolation does produce a fast learning algorithm for MED, it 
can be made significantly faster by a smarter routine for axis selection. 
One strategy is to learn which axes are critical for producing a large 
improvement in our objective function J( A). This can be seen as a first 
order table or model which puts a scalar weight on each axis, measuring 
its expected contribution to the increase in the objective function. We 
could thus sample from this table as a distribution and update axes that 
are crucial to increasing J( A) more frequently than irrelevant axes. A 
natural extension to this T-element table is to consider a T x T matrix 
where columns corresponds to the last axis that was optimized and the 
rows correspond to the next axis to optimize. Each row of this matrix 
thus specifies a distribution over the choice of axes for the next iteration 
given the last candidate that was attempted. Effectively, this mimics a 
Markov transition matrix over axes. By identifying which axes are good 
followers of the current axis, we can sample more efficiently from our list 
to get a greater expected improvement in the objective function. 

More specifically, we compute the improvement in the objective func- 
tion brought about by an axis optimization as AJ(A) as we go from an 
old value to a new one on an axis. Needless to say, all values of AJ(X) 
are non-negative (since each axis-parallel step is guaranteed to increase 
the objective). An additional problem is that the A J values must be 
discounted since we expect large gains at the early stages followed by 
exponentially reducing gains in J as we near convergence. Therefore, 
we model the change of the time- varying values A Jt as they arrive in an 
online manner over time. This is done by fitting an exponential model 
to the values 

A J T « aexp(— j3r). 

This parameterized curve is fit to data with a simple least squares 
criterion in an online way (i.e. we don’t need to explicitly store the 
values of A J T ). Figure 7.3 shows the fitting procedure to some values 
of A J. Thus, we can now adjust the values of the A J to obtain values 
which are appropriately discounted 

A J T = A J r — Q'exp(— j3r). 




184 



MACHINE LEARNING: DISCRIMINATIVE & GENERATIVE 




Figure 7.3. Approximating the decay rate in the change of the objective function. 

This can now be seen as the current true benefit of a given axis choice. 
In a greedy strategy, we pick the axis that generated largest A J T from 
our current axis iteration. Thus, we can form a table of the A J T with 
the expected value of an axis optimization given a current axis. At 
each iteration we select the axis which (given our current axis) has the 
highest value of A J T . We also still interleave random axis selections 
about 20% of the time to encourage exploration to fill up our table of 
axis-axis discounted objective function increments. In practice, we need 
not store all T x T axis-axis transition values but only the handful of 
transitions with the highest discounted A J values. Figure 7.4 depicts the 
approximately 10-fold increase in optimization speed that results from 
this axis choice strategy (here an MED linear regression with feature 
selection problem is depicted). 




Figure 7.4 ■ Axis-Parallel MED Maximization with Learned Axis Transition (solid 
line) and Random Transition (dashed line). 
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